[00:38:06] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/956015
[00:38:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/956015 (owner: 10TrainBranchBot)
[00:42:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:43:32] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T343198)', diff saved to https://phabricator.wikimedia.org/P52377 and previous config saved to /var/cache/conftool/dbconfig/20230911-004331-arnaudb.json
[00:43:35] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[00:44:33] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) firing: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[00:47:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:51:52] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/956015 (owner: 10TrainBranchBot)
[00:58:38] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P52378 and previous config saved to /var/cache/conftool/dbconfig/20230911-005837-arnaudb.json
[01:04:33] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) resolved: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[01:13:44] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P52379 and previous config saved to /var/cache/conftool/dbconfig/20230911-011343-arnaudb.json
[01:15:39] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:15:56] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[01:21:27] <wikibugs>	 (03PS5) 10Andrew Bogott: wmf_sink: don't assume project_name == project_id [puppet] - 10https://gerrit.wikimedia.org/r/955973 (https://phabricator.wikimedia.org/T343158)
[01:21:29] <wikibugs>	 (03PS1) 10Andrew Bogott: nova_fullstack_test: don't assume project_name == project_id [puppet] - 10https://gerrit.wikimedia.org/r/956088 (https://phabricator.wikimedia.org/T343158)
[01:21:55] <wikibugs>	 (03CR) 10Andrew Bogott: "tested in codfw1dev" [puppet] - 10https://gerrit.wikimedia.org/r/956088 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott)
[01:28:50] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T343198)', diff saved to https://phabricator.wikimedia.org/P52380 and previous config saved to /var/cache/conftool/dbconfig/20230911-012850-arnaudb.json
[01:28:52] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance
[01:28:54] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[01:29:05] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance
[01:29:11] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2127 (T343198)', diff saved to https://phabricator.wikimedia.org/P52381 and previous config saved to /var/cache/conftool/dbconfig/20230911-012911-arnaudb.json
[02:07:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:37:32] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:43:05] <icinga-wm>	 PROBLEM - Disk space on dbprov1004 is CRITICAL: DISK CRITICAL - free space: /srv 546743 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dbprov1004&var-datasource=eqiad+prometheus/ops
[04:15:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[04:20:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[04:44:23] <icinga-wm>	 RECOVERY - Disk space on dbprov1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dbprov1004&var-datasource=eqiad+prometheus/ops
[04:49:15] <icinga-wm>	 ACKNOWLEDGEMENT - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 16 down 2: Marostegui https://phabricator.wikimedia.org/T346012 https://wikitech.wikimedia.org/wiki/HAProxy
[04:49:33] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) firing: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[04:59:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1134', diff saved to https://phabricator.wikimedia.org/P52382 and previous config saved to /var/cache/conftool/dbconfig/20230911-045907-root.json
[05:00:49] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.mysql.clone of db1134.eqiad.wmnet onto db1128.eqiad.wmnet
[05:09:33] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) resolved: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[05:14:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Fix cloudbackup alias [puppet] - 10https://gerrit.wikimedia.org/r/955923 (owner: 10Muehlenhoff)
[05:15:56] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[05:31:57] <wikibugs>	 (03CR) 10Muehlenhoff: "A few additional comments" [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 (owner: 10Slyngshede)
[05:38:27] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db1119 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/956100 (https://phabricator.wikimedia.org/T339185)
[05:38:56] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1119 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/956100 (https://phabricator.wikimedia.org/T339185) (owner: 10Marostegui)
[05:40:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1119 back to s1 depooled T339185', diff saved to https://phabricator.wikimedia.org/P52383 and previous config saved to /var/cache/conftool/dbconfig/20230911-054057-marostegui.json
[05:41:01] <stashbot>	 T339185: Test MariaDB + Debian bookworm on databases - https://phabricator.wikimedia.org/T339185
[06:11:54] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 136065
[06:12:34] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 136065
[06:26:33] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+1] Enable MinT translation service in more wikis - rollout #3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956051 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro)
[06:43:31] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10ayounsi) Unfortunately the errors are back, even though not much it's still better to fix the issue.
[06:50:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Use a single ensure for managing the nftables state [puppet] - 10https://gerrit.wikimedia.org/r/955779 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[06:57:03] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1134.eqiad.wmnet onto db1128.eqiad.wmnet
[06:57:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Pass down the ensure to the requestctl settings [puppet] - 10https://gerrit.wikimedia.org/r/955865 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[06:58:48] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1128: Host crashed" [puppet] - 10https://gerrit.wikimedia.org/r/956054
[06:59:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:59:19] <wikibugs>	 (03PS2) 10Kosta Harlan: Add ReportIncident extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953998 (https://phabricator.wikimedia.org/T339275)
[06:59:24] <wikibugs>	 (03PS2) 10Kosta Harlan: ReportIncident: Default deployment to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953999 (https://phabricator.wikimedia.org/T339275)
[06:59:28] <wikibugs>	 (03PS2) 10Kosta Harlan: [beta] ReportIncident: Enable on kowiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955732 (https://phabricator.wikimedia.org/T339275)
[06:59:33] <wikibugs>	 (03PS2) 10Kosta Harlan: [beta] Enable ReportIncident for configured beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955735 (https://phabricator.wikimedia.org/T339275)
[06:59:38] <wikibugs>	 (03PS2) 10Kosta Harlan: ReportIncident: Set default help page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955821 (https://phabricator.wikimedia.org/T343382)
[07:00:06] <jouncebot>	 Amir1, Urbanecm, and taavi: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230911T0700)
[07:00:06] <jouncebot>	 kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:19] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1128: Host crashed" [puppet] - 10https://gerrit.wikimedia.org/r/956054 (owner: 10Marostegui)
[07:00:38] <kostajh>	 good morning
[07:01:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 1%: Repooling after being recloned T345509', diff saved to https://phabricator.wikimedia.org/P52384 and previous config saved to /var/cache/conftool/dbconfig/20230911-070114-root.json
[07:01:18] <stashbot>	 T345509: db1128 mariadb crashed - https://phabricator.wikimedia.org/T345509
[07:02:50] <kostajh>	 I've not deployed a change to wmf-config/extension-list before. Do I use `scap backport` for this? 
[07:02:56] <taavi>	 morning
[07:04:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:04:33] <taavi>	 yes, `scap backport` is fine
[07:06:06] <taavi>	 just make sure to do that separately to the patch enabling the extension
[07:06:06] <kostajh>	 ok
[07:06:29] <kostajh>	 taavi: does this stack of patches look OK to you? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/953998/
[07:06:29] <taavi>	 kostajh: do you need someone to deploy it for you or will you self-deploy? (sorry, I don't remember if you have the rights or not)
[07:06:41] <kostajh>	 I can self-deploy if the patches look ok
[07:06:52] <kostajh>	 the intended outcome is: extension disabled in production, and enabled in kowiki on betalabs
[07:08:18] <taavi>	 seems fine on a quick glance
[07:08:25] <kostajh>	 alright
[07:08:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953998 (https://phabricator.wikimedia.org/T339275) (owner: 10Kosta Harlan)
[07:10:46] <wikibugs>	 (03Merged) 10jenkins-bot: Add ReportIncident extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953998 (https://phabricator.wikimedia.org/T339275) (owner: 10Kosta Harlan)
[07:11:17] <logmsgbot>	 !log kharlan@deploy1002 Started scap: Backport for [[gerrit:953998|Add ReportIncident extension (T339275)]]
[07:11:21] <stashbot>	 T339275: Deploy to beta cluster - https://phabricator.wikimedia.org/T339275
[07:16:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 3%: Repooling after being recloned T345509', diff saved to https://phabricator.wikimedia.org/P52385 and previous config saved to /var/cache/conftool/dbconfig/20230911-071619-root.json
[07:16:23] <stashbot>	 T345509: db1128 mariadb crashed - https://phabricator.wikimedia.org/T345509
[07:17:32] <wikibugs>	 (03PS7) 10Slyngshede: Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160
[07:17:41] <wikibugs>	 (03CR) 10Slyngshede: Allow packing as a .deb (0320 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 (owner: 10Slyngshede)
[07:22:42] <kostajh>	 (waiting for k8s image build/push to do its thing)
[07:23:57] <logmsgbot>	 !log kharlan@deploy1002 Started scap: Backport for [[gerrit:953998|Add ReportIncident extension (T339275)]]
[07:23:59] <kostajh>	 trying again with `tmux`, as the connection hung up :\
[07:23:59] <stashbot>	 T339275: Deploy to beta cluster - https://phabricator.wikimedia.org/T339275
[07:27:23] <wikibugs>	 (03CR) 10Muehlenhoff: Allow packing as a .deb (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 (owner: 10Slyngshede)
[07:31:05] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/956086 (owner: 10Majavah)
[07:31:21] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] P:wmcs::metricsinfra: add missing trailing slash to url [puppet] - 10https://gerrit.wikimedia.org/r/956086 (owner: 10Majavah)
[07:31:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 5%: Repooling after being recloned T345509', diff saved to https://phabricator.wikimedia.org/P52386 and previous config saved to /var/cache/conftool/dbconfig/20230911-073124-root.json
[07:31:29] <stashbot>	 T345509: db1128 mariadb crashed - https://phabricator.wikimedia.org/T345509
[07:31:38] <wikibugs>	 (03PS8) 10Slyngshede: Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160
[07:32:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3005.esams.wmnet
[07:33:36] <logmsgbot>	 !log kharlan@deploy1002 kharlan: Backport for [[gerrit:953998|Add ReportIncident extension (T339275)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[07:33:38] <stashbot>	 T339275: Deploy to beta cluster - https://phabricator.wikimedia.org/T339275
[07:35:27] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[07:35:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3005.esams.wmnet
[07:36:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Not sure if you are waiting for me on merging this, at any rate I'll go ahead and merge! HTH" [puppet] - 10https://gerrit.wikimedia.org/r/955924 (owner: 10Brouberol)
[07:36:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Grant permissions on icinga to user Brouberol [puppet] - 10https://gerrit.wikimedia.org/r/955924 (owner: 10Brouberol)
[07:36:24] <logmsgbot>	 !log kharlan@deploy1002 kharlan: Continuing with sync
[07:41:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 1%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52387 and previous config saved to /var/cache/conftool/dbconfig/20230911-074116-root.json
[07:42:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Patch LGTM, thank you!. I've cc'ed Ben for an heads-up: this change won't impact existing statsd metrics, and will make graphite failover " [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790) (owner: 10Andrea Denisse)
[07:43:20] <wikibugs>	 (03PS9) 10Slyngshede: Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160
[07:43:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:43:35] <wikibugs>	 (03CR) 10Slyngshede: Allow packing as a .deb (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 (owner: 10Slyngshede)
[07:45:30] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10Observability-Metrics, 10superset.wikimedia.org: statsd and gunicorn metrics for superset - https://phabricator.wikimedia.org/T293761 (10fgiunchedi) The statsd-exporter part of this work is happening in {T345790} because we need to make graphite failovers simpler. Technically...
[07:46:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 10%: Repooling after being recloned T345509', diff saved to https://phabricator.wikimedia.org/P52388 and previous config saved to /var/cache/conftool/dbconfig/20230911-074629-root.json
[07:46:33] <stashbot>	 T345509: db1128 mariadb crashed - https://phabricator.wikimedia.org/T345509
[07:46:41] <logmsgbot>	 !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:953998|Add ReportIncident extension (T339275)]] (duration: 22m 44s)
[07:46:44] <stashbot>	 T339275: Deploy to beta cluster - https://phabricator.wikimedia.org/T339275
[07:48:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953999 (https://phabricator.wikimedia.org/T339275) (owner: 10Kosta Harlan)
[07:48:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:48:36] <kostajh>	 taavi: `scap backport` is not really useful for beta cluster patches, is that correct? I can just +2 those myself via the gerrit UI? 
[07:48:49] <wikibugs>	 (03Merged) 10jenkins-bot: ReportIncident: Default deployment to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953999 (https://phabricator.wikimedia.org/T339275) (owner: 10Kosta Harlan)
[07:49:08] <logmsgbot>	 !log kharlan@deploy1002 Started scap: Backport for [[gerrit:953999|ReportIncident: Default deployment to false (T339275)]]
[07:49:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Very nice!" [puppet] - 10https://gerrit.wikimedia.org/r/956071 (https://phabricator.wikimedia.org/T313030) (owner: 10Majavah)
[07:49:39] <taavi>	 kostajh: `scap backport` merges the patch and pulls it to the deployment server so the next deployer won't have an unexpected git state. you can do that manually too, yes
[07:49:44] <wikibugs>	 (03PS10) 10Slyngshede: Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160
[07:49:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] icinga: drop toolforge.org cert monitor [puppet] - 10https://gerrit.wikimedia.org/r/956029 (owner: 10Majavah)
[07:50:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] icinga: drop tools.wmflabs.org monitoring [puppet] - 10https://gerrit.wikimedia.org/r/956072 (owner: 10Majavah)
[07:50:43] <logmsgbot>	 !log kharlan@deploy1002 kharlan: Backport for [[gerrit:953999|ReportIncident: Default deployment to false (T339275)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[07:52:06] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] P:toolforge::checker: remove ToolsDB R/W check [puppet] - 10https://gerrit.wikimedia.org/r/956071 (https://phabricator.wikimedia.org/T313030) (owner: 10Majavah)
[07:52:19] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] icinga: drop toolforge.org cert monitor [puppet] - 10https://gerrit.wikimedia.org/r/956029 (owner: 10Majavah)
[07:52:35] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] icinga: drop tools.wmflabs.org monitoring [puppet] - 10https://gerrit.wikimedia.org/r/956072 (owner: 10Majavah)
[07:52:45] <wikibugs>	 (03PS2) 10Majavah: icinga: drop tools.wmflabs.org monitoring [puppet] - 10https://gerrit.wikimedia.org/r/956072
[07:53:58] <logmsgbot>	 !log kharlan@deploy1002 kharlan: Continuing with sync
[07:54:17] <kostajh>	 taavi: ack, thanks
[07:56:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 3%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52389 and previous config saved to /var/cache/conftool/dbconfig/20230911-075621-root.json
[07:58:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] citoid: update mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/955894 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[07:58:19] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] citoid: enable mesh tracing in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/955895 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[07:59:12] <logmsgbot>	 !log filippo@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply
[07:59:43] <logmsgbot>	 !log filippo@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply
[08:00:10] <logmsgbot>	 !log filippo@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply
[08:00:13] <logmsgbot>	 !log filippo@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply
[08:00:24] <logmsgbot>	 !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:953999|ReportIncident: Default deployment to false (T339275)]] (duration: 11m 15s)
[08:00:32] <logmsgbot>	 !log filippo@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply
[08:00:33] <stashbot>	 T339275: Deploy to beta cluster - https://phabricator.wikimedia.org/T339275
[08:00:45] <logmsgbot>	 !log filippo@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply
[08:01:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 25%: Repooling after being recloned T345509', diff saved to https://phabricator.wikimedia.org/P52390 and previous config saved to /var/cache/conftool/dbconfig/20230911-080133-root.json
[08:01:37] <stashbot>	 T345509: db1128 mariadb crashed - https://phabricator.wikimedia.org/T345509
[08:01:41] <logmsgbot>	 !log filippo@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply
[08:01:44] <wikibugs>	 (03CR) 10Muehlenhoff: "Two more comments inline, which I had missed before" [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 (owner: 10Slyngshede)
[08:02:06] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955732 (https://phabricator.wikimedia.org/T339275) (owner: 10Kosta Harlan)
[08:02:07] <logmsgbot>	 !log filippo@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply
[08:02:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955735 (https://phabricator.wikimedia.org/T339275) (owner: 10Kosta Harlan)
[08:02:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955821 (https://phabricator.wikimedia.org/T343382) (owner: 10Kosta Harlan)
[08:02:26] <logmsgbot>	 !log filippo@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply
[08:02:53] <logmsgbot>	 !log filippo@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply
[08:02:59] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] ReportIncident: Enable on kowiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955732 (https://phabricator.wikimedia.org/T339275) (owner: 10Kosta Harlan)
[08:03:25] <logmsgbot>	 !log filippo@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply
[08:03:28] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] Enable ReportIncident for configured beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955735 (https://phabricator.wikimedia.org/T339275) (owner: 10Kosta Harlan)
[08:03:31] <wikibugs>	 (03Merged) 10jenkins-bot: ReportIncident: Set default help page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955821 (https://phabricator.wikimedia.org/T343382) (owner: 10Kosta Harlan)
[08:03:46] <logmsgbot>	 !log kharlan@deploy1002 Started scap: Backport for [[gerrit:955732|[beta] ReportIncident: Enable on kowiki beta (T339275)]], [[gerrit:955735|[beta] Enable ReportIncident for configured beta wikis (T339275)]], [[gerrit:955821|ReportIncident: Set default help page (T343382)]]
[08:03:47] <logmsgbot>	 !log filippo@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply
[08:03:51] <stashbot>	 T343382: Make link to code of conduct and wiki administrators page configurable per wiki - https://phabricator.wikimedia.org/T343382
[08:05:15] <logmsgbot>	 !log kharlan@deploy1002 kharlan: Backport for [[gerrit:955732|[beta] ReportIncident: Enable on kowiki beta (T339275)]], [[gerrit:955735|[beta] Enable ReportIncident for configured beta wikis (T339275)]], [[gerrit:955821|ReportIncident: Set default help page (T343382)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deplo
[08:05:15] <logmsgbot>	 yment (accessible via k8s-experimental XWD option)
[08:05:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3007.esams.wmnet
[08:06:48] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review: "Ensure hosts are not performing a change on every puppet run" alert is failing - https://phabricator.wikimedia.org/T345909 (10fgiunchedi) Looks like the alert is working as expected: https://alerts.wikimedia.org/?q=%40sta...
[08:07:37] <logmsgbot>	 !log kharlan@deploy1002 kharlan: Continuing with sync
[08:08:19] <wikibugs>	 (03PS1) 10Tim Starling: Remove PHP 7.2 fallback for array_key_first() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956364
[08:08:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3007.esams.wmnet
[08:11:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 5%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52391 and previous config saved to /var/cache/conftool/dbconfig/20230911-081126-root.json
[08:13:31] <logmsgbot>	 !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:955732|[beta] ReportIncident: Enable on kowiki beta (T339275)]], [[gerrit:955735|[beta] Enable ReportIncident for configured beta wikis (T339275)]], [[gerrit:955821|ReportIncident: Set default help page (T343382)]] (duration: 09m 44s)
[08:13:35] <stashbot>	 T343382: Make link to code of conduct and wiki administrators page configurable per wiki - https://phabricator.wikimedia.org/T343382
[08:13:35] <stashbot>	 T339275: Deploy to beta cluster - https://phabricator.wikimedia.org/T339275
[08:13:45] <kostajh>	 !log UTC morning deploys done
[08:13:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:16:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 50%: Repooling after being recloned T345509', diff saved to https://phabricator.wikimedia.org/P52392 and previous config saved to /var/cache/conftool/dbconfig/20230911-081638-root.json
[08:16:42] <stashbot>	 T345509: db1128 mariadb crashed - https://phabricator.wikimedia.org/T345509
[08:17:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:20:26] <claime>	 !log rebooting mwdebug1002.eqiad.wmnet
[08:20:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:38] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host mwdebug1002.eqiad.wmnet
[08:20:56] <wikibugs>	 (03PS11) 10Slyngshede: Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160
[08:21:39] <wikibugs>	 (03CR) 10Slyngshede: Allow packing as a .deb (032 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 (owner: 10Slyngshede)
[08:22:22] <wikibugs>	 10Puppet: pg replication lag UNKNOWN for puppetdb2003 - https://phabricator.wikimedia.org/T346016 (10fgiunchedi)
[08:22:48] <wikibugs>	 10Puppet: pg replication lag UNKNOWN for puppetdb2003 - https://phabricator.wikimedia.org/T346016 (10fgiunchedi)
[08:24:17] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] kubernetes::master: Validate SA tokens with the certs of all masters [puppet] - 10https://gerrit.wikimedia.org/r/954677 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[08:25:03] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] kubernetes::master: Validate SA tokens with the certs of all masters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954677 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[08:25:04] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwdebug1002.eqiad.wmnet
[08:26:15] <claime>	 !log rebooting mwdebug1001.eqiad.wmnet
[08:26:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:23] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host mwdebug1001.eqiad.wmnet
[08:26:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 10%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52393 and previous config saved to /var/cache/conftool/dbconfig/20230911-082631-root.json
[08:28:04] <wikibugs>	 (03PS12) 10Slyngshede: Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160
[08:28:15] <icinga-wm>	 PROBLEM - HTTPS Ganeti RAPI esams on ganeti3007 is CRITICAL: connect to address ganeti01.svc.esams.wmnet and port 5080: No route to host https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon
[08:31:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 75%: Repooling after being recloned T345509', diff saved to https://phabricator.wikimedia.org/P52394 and previous config saved to /var/cache/conftool/dbconfig/20230911-083143-root.json
[08:31:47] <stashbot>	 T345509: db1128 mariadb crashed - https://phabricator.wikimedia.org/T345509
[08:32:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Ship it" [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 (owner: 10Slyngshede)
[08:32:59] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T343198)', diff saved to https://phabricator.wikimedia.org/P52395 and previous config saved to /var/cache/conftool/dbconfig/20230911-083258-arnaudb.json
[08:33:02] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[08:33:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1119 with Debian Bookworm in s1 with just 1% T339185', diff saved to https://phabricator.wikimedia.org/P52396 and previous config saved to /var/cache/conftool/dbconfig/20230911-083346-marostegui.json
[08:33:48] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwdebug1001.eqiad.wmnet
[08:33:51] <stashbot>	 T339185: Test MariaDB + Debian bookworm on databases - https://phabricator.wikimedia.org/T339185
[08:34:33] <wikibugs>	 (03PS1) 10Marostegui: db1119: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/956366 (https://phabricator.wikimedia.org/T339185)
[08:37:13] <claime>	 !log rebooting mwmaint2002.codfw.wmnet
[08:37:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:19] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host mwmaint2002.codfw.wmnet
[08:40:28] <urbanecm>	 jouncebot: nowandnext
[08:40:28] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 19 minute(s)
[08:40:28] <jouncebot>	 In 1 hour(s) and 19 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230911T1000)
[08:40:59] <wikibugs>	 (03PS3) 10Urbanecm: Revert "Growth: Disable Add an image on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955049 (https://phabricator.wikimedia.org/T345188)
[08:41:04] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Revert "Growth: Disable Add an image on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955049 (https://phabricator.wikimedia.org/T345188) (owner: 10Urbanecm)
[08:41:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:41:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 25%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52397 and previous config saved to /var/cache/conftool/dbconfig/20230911-084135-root.json
[08:41:43] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Growth: Disable Add an image on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955049 (https://phabricator.wikimedia.org/T345188) (owner: 10Urbanecm)
[08:42:16] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:955049|Revert "Growth: Disable Add an image on all wikis" (T345188)]]
[08:42:18] <stashbot>	 T345188: Add Image: all wikis ran out of image recommendations - https://phabricator.wikimedia.org/T345188
[08:42:35] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_esams01_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:43:13] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_esams01_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_esams01_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:44:32] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:955049|Revert "Growth: Disable Add an image on all wikis" (T345188)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[08:45:27] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwmaint2002.codfw.wmnet
[08:45:40] <jinxer-wm>	 (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[08:46:17] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Continuing with sync
[08:46:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:46:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 100%: Repooling after being recloned T345509', diff saved to https://phabricator.wikimedia.org/P52398 and previous config saved to /var/cache/conftool/dbconfig/20230911-084647-root.json
[08:46:51] <stashbot>	 T345509: db1128 mariadb crashed - https://phabricator.wikimedia.org/T345509
[08:48:05] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P52399 and previous config saved to /var/cache/conftool/dbconfig/20230911-084804-arnaudb.json
[08:48:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/955916 (owner: 10Muehlenhoff)
[08:51:00] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[08:51:24] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[08:51:30] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T343198)', diff saved to https://phabricator.wikimedia.org/P52400 and previous config saved to /var/cache/conftool/dbconfig/20230911-085129-arnaudb.json
[08:51:33] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[08:51:55] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) In terms of the LVS connections from rows C and D, when we move from old switches to new ones we need to land those on the Spines rather t...
[08:52:32] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:52:43] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:955049|Revert "Growth: Disable Add an image on all wikis" (T345188)]] (duration: 10m 27s)
[08:52:47] <stashbot>	 T345188: Add Image: all wikis ran out of image recommendations - https://phabricator.wikimedia.org/T345188
[08:52:50] * urbanecm done
[08:54:33] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) firing: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[08:54:48] <jinxer-wm>	 (KubernetesAPILatency) firing: (5) High Kubernetes API latency (PATCH events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:56:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 50%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52401 and previous config saved to /var/cache/conftool/dbconfig/20230911-085640-root.json
[08:59:48] <jinxer-wm>	 (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (PATCH events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:03:11] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P52402 and previous config saved to /var/cache/conftool/dbconfig/20230911-090310-arnaudb.json
[09:05:13] <wikibugs>	 (03PS1) 10Muehlenhoff: ganeti: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/956367
[09:08:44] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[09:10:46] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/956367 (owner: 10Muehlenhoff)
[09:11:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 75%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52403 and previous config saved to /var/cache/conftool/dbconfig/20230911-091145-root.json
[09:11:53] <wikibugs>	 (03PS1) 10Jbond: firewall: move requestctl logic outside of the ferm block [puppet] - 10https://gerrit.wikimedia.org/r/956368
[09:14:33] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) resolved: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[09:18:17] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T343198)', diff saved to https://phabricator.wikimedia.org/P52404 and previous config saved to /var/cache/conftool/dbconfig/20230911-091817-arnaudb.json
[09:18:19] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance
[09:18:20] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[09:18:32] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance
[09:18:32] <claime>	 jouncebot: nowandnext
[09:18:32] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 41 minute(s)
[09:18:32] <jouncebot>	 In 0 hour(s) and 41 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230911T1000)
[09:18:41] <claime>	 !log rebooting deploy2002.codfw.wmnet
[09:18:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:18:48] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host deploy2002.codfw.wmnet
[09:19:07] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1119: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/956366 (https://phabricator.wikimedia.org/T339185) (owner: 10Marostegui)
[09:22:04] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops: mw2444 down - https://phabricator.wikimedia.org/T345884 (10jijiki)
[09:22:09] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops: mw2444 down - https://phabricator.wikimedia.org/T345884 (10jijiki) @Jhancock.wm  I am afraid the server is dead again :(
[09:24:20] <wikibugs>	 (03PS2) 10Jbond: firewall: move requestctl logic outside of the ferm block [puppet] - 10https://gerrit.wikimedia.org/r/956368
[09:24:21] <logmsgbot>	 !log gehel@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs1009.eqiad.wmnet with reason: T342361 - testing blazegraph startup script refactor
[09:24:24] <stashbot>	 T342361: Examine/refactor WDQS startup scripts - https://phabricator.wikimedia.org/T342361
[09:24:34] <logmsgbot>	 !log gehel@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs1009.eqiad.wmnet with reason: T342361 - testing blazegraph startup script refactor
[09:25:01] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[09:26:45] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host deploy2002.codfw.wmnet
[09:26:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 100%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52405 and previous config saved to /var/cache/conftool/dbconfig/20230911-092650-root.json
[09:29:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/956368 (owner: 10Jbond)
[09:29:39] <jinxer-wm>	 (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[09:30:27] <claime>	 ^That's my bad
[09:32:37] <claime>	 !log rearmed keyholder on deploy2002.codfw.wmnet
[09:32:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:31] <wikibugs>	 (03PS1) 10Hnowlan: jobqueue: limit thumbnailrender job concurrency further [deployment-charts] - 10https://gerrit.wikimedia.org/r/956370 (https://phabricator.wikimedia.org/T337649)
[09:33:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] firewall: move requestctl logic outside of the ferm block [puppet] - 10https://gerrit.wikimedia.org/r/956368 (owner: 10Jbond)
[09:34:39] <jinxer-wm>	 (KeyholderUnarmed) resolved: 18 unarmed Keyholder key(s) on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[09:35:54] <wikibugs>	 (03PS2) 10Hnowlan: jobqueue, thumbor: attempt to limit impact of thumbnailrender job [deployment-charts] - 10https://gerrit.wikimedia.org/r/956370 (https://phabricator.wikimedia.org/T337649)
[09:38:24] <wikibugs>	 (03PS1) 10Jbond: firewall: only create stub file in the present changes [puppet] - 10https://gerrit.wikimedia.org/r/956371
[09:38:41] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ores-extension: enable lw in enwiki and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956372 (https://phabricator.wikimedia.org/T342115)
[09:40:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/956371 (owner: 10Jbond)
[09:40:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] firewall: only create stub file in the present changes [puppet] - 10https://gerrit.wikimedia.org/r/956371 (owner: 10Jbond)
[09:42:58] <wikibugs>	 (03PS2) 10Jbond: firewall: only create stub file in the present changes [puppet] - 10https://gerrit.wikimedia.org/r/956371
[09:43:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] firewall: only create stub file in the present changes [puppet] - 10https://gerrit.wikimedia.org/r/956371 (owner: 10Jbond)
[09:43:39] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "wow!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956372 (https://phabricator.wikimedia.org/T342115) (owner: 10Ilias Sarantopoulos)
[09:43:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader1004.wikimedia.org
[09:48:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader1004.wikimedia.org
[09:48:48] <wikibugs>	 (03PS1) 10Elukey: profile::service_proxy::envoy: set use_ingress for thanos-swift [puppet] - 10https://gerrit.wikimedia.org/r/956373 (https://phabricator.wikimedia.org/T339890)
[09:50:06] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::service_proxy::envoy: set use_ingress for thanos-swift [puppet] - 10https://gerrit.wikimedia.org/r/956373 (https://phabricator.wikimedia.org/T339890) (owner: 10Elukey)
[09:51:59] <wikibugs>	 (03PS1) 10Btullis: Refactor spark support to build multiple minor versions [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/956374
[09:52:35] <wikibugs>	 (03PS1) 10Jbond: ferm: Add force true to force dir removal [puppet] - 10https://gerrit.wikimedia.org/r/956375
[09:53:08] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[09:53:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader2003.wikimedia.org
[09:55:52] <wikibugs>	 (03Abandoned) 10Jbond: puppetdb: migrate check to puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/955928 (https://phabricator.wikimedia.org/T345909) (owner: 10Jbond)
[09:56:03] <wikibugs>	 (03Abandoned) 10Jbond: check_puppet_run_changes: update to run on puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/955939 (owner: 10Jbond)
[09:56:30] <wikibugs>	 (03PS2) 10Btullis: Refactor spark support to build multiple minor versions [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/956374
[09:57:54] <wikibugs>	 (03PS3) 10Btullis: Refactor spark support to build multiple minor versions [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/956374 (https://phabricator.wikimedia.org/T344910)
[09:57:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader2003.wikimedia.org
[09:59:49] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Add a dependency on the opensearch-py client - https://phabricator.wikimedia.org/T345900 (10brouberol) Actually, after having experimented with supporting both OpenSearch and Elasticsearch in spicerack with local experiments, we've decided to put a pin...
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230911T1000)
[10:03:40] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.reimage for host gitlab1003.wikimedia.org with OS bullseye
[10:07:15] <wikibugs>	 (03PS1) 10Elukey: profile::service_proxy::envoy: rename uses_ingress to sets_sni [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T339890)
[10:08:12] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:08:32] <wikibugs>	 (03PS1) 10Elukey: ml-services: update Docker image for recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/956380 (https://phabricator.wikimedia.org/T339890)
[10:09:47] <Amir1>	 jouncebot: nowandnext
[10:09:47] <jouncebot>	 For the next 0 hour(s) and 50 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230911T1000)
[10:09:47] <jouncebot>	 In 2 hour(s) and 50 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230911T1300)
[10:10:33] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:11:20] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/956380 (https://phabricator.wikimedia.org/T339890) (owner: 10Elukey)
[10:11:31] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: update Docker image for recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/956380 (https://phabricator.wikimedia.org/T339890) (owner: 10Elukey)
[10:11:36] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_esams01_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:14:05] <wikibugs>	 (03PS1) 10AikoChou: ml-services: update readability model binary [deployment-charts] - 10https://gerrit.wikimedia.org/r/956381
[10:14:53] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[10:15:32] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:16:02] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab1003.wikimedia.org with reason: host reimage
[10:18:29] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab1003.wikimedia.org with reason: host reimage
[10:21:12] <wikibugs>	 (03CR) 10AikoChou: [C: 03+2] ml-services: update readability model binary [deployment-charts] - 10https://gerrit.wikimedia.org/r/956381 (owner: 10AikoChou)
[10:21:54] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update readability model binary [deployment-charts] - 10https://gerrit.wikimedia.org/r/956381 (owner: 10AikoChou)
[10:22:15] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Add a dependency on the opensearch-py client - https://phabricator.wikimedia.org/T345900 (10brouberol) 05Open→03Declined
[10:22:42] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Add a dependency on the opensearch-py client - https://phabricator.wikimedia.org/T345900 (10Volans) @brouberol thanks for the summary and update! Curator is the dependency that mostly creates issues and I think it would be great if we will plan for a pa...
[10:24:23] <wikibugs>	 (03PS1) 10Btullis: Retain python2 on the test hadoop standby role [puppet] - 10https://gerrit.wikimedia.org/r/956383 (https://phabricator.wikimedia.org/T329363)
[10:25:14] <wikibugs>	 (03CR) 10Nikerabbit: [C: 03+1] Enable MinT translation service in more wikis - rollout #3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956051 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro)
[10:25:37] <wikibugs>	 (03CR) 10Nikerabbit: [C: 03+1] Disable Special:Contribute on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956078 (https://phabricator.wikimedia.org/T345772) (owner: 10KartikMistry)
[10:26:10] <wikibugs>	 (03CR) 10Nikerabbit: [C: 03+1] Add Akan language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955007 (https://phabricator.wikimedia.org/T333765) (owner: 10Srishakatux)
[10:26:19] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43194/console" [puppet] - 10https://gerrit.wikimedia.org/r/956383 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[10:27:52] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[10:30:32] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:34:48] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Thumbor, 10Traffic: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334 (10MatthewVernon) Thanks for this; I agree that we should probably (virtually) sit down and talk about this; I wanted to try and make sure we had most of the o...
[10:38:18] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:39:09] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab1003.wikimedia.org with OS bullseye
[10:42:34] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_esams01_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:42:48] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v7.2.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/956384
[10:42:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/956375 (owner: 10Jbond)
[10:43:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/956383 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[10:43:49] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Retain python2 on the test hadoop standby role [puppet] - 10https://gerrit.wikimedia.org/r/956383 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[10:44:46] <wikibugs>	 (03PS1) 10Elukey: ml-services: add REQUESTS_CA_BUNDLE env var to rec-api-ng's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/956385 (https://phabricator.wikimedia.org/T339890)
[10:46:54] <wikibugs>	 (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v7.2.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/956384 (owner: 10Volans)
[10:48:33] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:49:01] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/956385 (https://phabricator.wikimedia.org/T339890) (owner: 10Elukey)
[10:50:27] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, I think though that this will need a manual cleanup of the existing checkout." [puppet] - 10https://gerrit.wikimedia.org/r/955937 (https://phabricator.wikimedia.org/T343894) (owner: 10FNegri)
[10:50:54] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v7.2.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/956384 (owner: 10Volans)
[10:51:41] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: add REQUESTS_CA_BUNDLE env var to rec-api-ng's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/956385 (https://phabricator.wikimedia.org/T339890) (owner: 10Elukey)
[10:54:26] <wikibugs>	 (03CR) 10Volans: "reply inline" [software/homer] - 10https://gerrit.wikimedia.org/r/947352 (https://phabricator.wikimedia.org/T328747) (owner: 10Ayounsi)
[10:55:21] <wikibugs>	 (03PS1) 10Volans: Upstream release v7.2.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/956386
[10:55:30] <wikibugs>	 (03CR) 10Volans: [V: 03+2 C: 03+2] Upstream release v7.2.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/956386 (owner: 10Volans)
[10:56:35] <wikibugs>	 (03CR) 10FNegri: [V: 03+1 C: 03+2] [cluster::cloud_management] Don't install prod cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/955937 (https://phabricator.wikimedia.org/T343894) (owner: 10FNegri)
[10:57:14] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/948203 (owner: 10Amire80)
[10:59:13] <volans>	 !log uploaded spicerack_7.2.2 to apt.wikimedia.org bullseye-wikimedia
[10:59:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:02:00] <wikibugs>	 (03PS1) 10Clément Goubert: mw-api-ext, mw-web: Raise total replicas to 14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/956388 (https://phabricator.wikimedia.org/T341780)
[11:03:25] <wikibugs>	 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: [spicerack] split SRE cookbooks into "shared" and "SRE-only" - https://phabricator.wikimedia.org/T343894 (10fnegri) 05Open→03Declined I merged the patch above and cleaned up the SRE cookbooks from cloudcumin[1-2]...
[11:05:00] <wikibugs>	 (03PS1) 10Clément Goubert: mw-on-k8s: Raise traffic to 5% [puppet] - 10https://gerrit.wikimedia.org/r/956390 (https://phabricator.wikimedia.org/T341780)
[11:05:29] <wikibugs>	 (03PS2) 10Clément Goubert: mw-api-ext, mw-web: Raise total replicas to 14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/956388 (https://phabricator.wikimedia.org/T341780)
[11:05:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mw-on-k8s: Raise traffic to 5% [puppet] - 10https://gerrit.wikimedia.org/r/956390 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert)
[11:06:14] <volans>	 !log installed spicearck v7.2.2 on both cumin hosts
[11:06:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:17] <volans>	 XioNoX: ^^^
[11:06:22] <volans>	 all yous
[11:06:27] <XioNoX>	 thanks!
[11:06:34] <XioNoX>	 I'll give it a try later on
[11:06:40] <wikibugs>	 (03CR) 10Clément Goubert: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/956390 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert)
[11:06:55] <isaranto>	 Heads up! I'm going to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/956372/ with Amir1:
[11:07:51] <TheresNoTime>	 gl!
[11:07:58] <icinga-wm>	 RECOVERY - HTTPS Ganeti RAPI esams on ganeti3007 is OK: HTTP OK: Status line output matched 401 - 308 bytes in 0.015 second response time https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon
[11:08:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by isaranto@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956372 (https://phabricator.wikimedia.org/T342115) (owner: 10Ilias Sarantopoulos)
[11:08:18] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:08:34] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_esams01_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_esams01_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:08:50] <wikibugs>	 (03Merged) 10jenkins-bot: ores-extension: enable lw in enwiki and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956372 (https://phabricator.wikimedia.org/T342115) (owner: 10Ilias Sarantopoulos)
[11:09:06] <logmsgbot>	 !log isaranto@deploy1002 Started scap: Backport for [[gerrit:956372|ores-extension: enable lw in enwiki and wikidata (T342115)]]
[11:09:09] <stashbot>	 T342115: Deployment of Lift Wing usage to all wikis that use ores extension - https://phabricator.wikimedia.org/T342115
[11:10:39] <logmsgbot>	 !log isaranto@deploy1002 isaranto: Backport for [[gerrit:956372|ores-extension: enable lw in enwiki and wikidata (T342115)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[11:13:33] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:13:47] <wikibugs>	 (03PS4) 10Winston Sung: Add Akan language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955007 (https://phabricator.wikimedia.org/T333765) (owner: 10Srishakatux)
[11:19:29] <wikibugs>	 (03PS3) 10Muehlenhoff: Remove debian::codename::require::min() checks for Buster [puppet] - 10https://gerrit.wikimedia.org/r/955909
[11:20:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host seaborgium.wikimedia.org
[11:23:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host seaborgium.wikimedia.org
[11:26:36] <claime>	 !log Rebooting poolcounter2004.codfw.wmnet
[11:26:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:26:40] <logmsgbot>	 !log isaranto@deploy1002 isaranto: Continuing with sync
[11:26:41] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host poolcounter2004.codfw.wmnet
[11:28:06] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] mw-api-ext, mw-web: Raise total replicas to 14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/956388 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert)
[11:28:22] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] mw-on-k8s: Raise traffic to 5% [puppet] - 10https://gerrit.wikimedia.org/r/956390 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert)
[11:30:57] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter2004.codfw.wmnet
[11:31:49] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Puppet-Infrastructure: "Ensure hosts are not performing a change on every puppet run" alert is failing - https://phabricator.wikimedia.org/T345909 (10jbond) 05In progress→03Resolved a:03jbond >>! In T345909#9155302, @fgiunchedi wrote: > Looks like the alert is...
[11:31:59] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] ferm: Add force true to force dir removal [puppet] - 10https://gerrit.wikimedia.org/r/956375 (owner: 10Jbond)
[11:32:53] <logmsgbot>	 !log isaranto@deploy1002 Finished scap: Backport for [[gerrit:956372|ores-extension: enable lw in enwiki and wikidata (T342115)]] (duration: 23m 46s)
[11:32:56] <stashbot>	 T342115: Deployment of Lift Wing usage to all wikis that use ores extension - https://phabricator.wikimedia.org/T342115
[11:35:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2014.codfw.wmnet
[11:37:54] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "LGTM, thanks for fixing kafka as well!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/955032 (owner: 10Ebernhardson)
[11:39:35] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] flink-app: Allow declaring zookeeper clusters by name [deployment-charts] - 10https://gerrit.wikimedia.org/r/955033 (owner: 10Ebernhardson)
[11:41:58] <claime>	 !log Rebooting poolcounter2003.codfw.wmnet
[11:42:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:05] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host poolcounter2003.codfw.wmnet
[11:42:09] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2014.codfw.wmnet
[11:42:45] <Amir1>	 !log setting binlog format to STATEMENT in x1 eqiad and codfw masters (T337310)
[11:42:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:49] <stashbot>	 T337310: Remove notification_seen column from echo_notifications database table - https://phabricator.wikimedia.org/T337310
[11:43:48] <wikibugs>	 (03PS1) 10Muehlenhoff: ferm: Move more files under the service check conditional [puppet] - 10https://gerrit.wikimedia.org/r/956410 (https://phabricator.wikimedia.org/T336497)
[11:45:19] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] Enable PageNotice on enwiktionary beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668156 (https://phabricator.wikimedia.org/T61245) (owner: 10TTO)
[11:45:54] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter2003.codfw.wmnet
[11:46:11] <wikibugs>	 (03PS1) 10Ayounsi: Routinator: tmpfs, bump the maximum number of inodes [puppet] - 10https://gerrit.wikimedia.org/r/956411 (https://phabricator.wikimedia.org/T300955)
[11:51:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2014.codfw.wmnet to cluster codfw and group C
[11:51:20] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2014.codfw.wmnet to cluster codfw and group C
[11:51:28] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "While I am not the best person to weigh on the spark front, the version split approach seems fine. However, there is the caveat, that you " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/956374 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[11:52:53] <wikibugs>	 10SRE, 10ops-codfw: ganeti2014: broken RAM - https://phabricator.wikimedia.org/T341546 (10MoritzMuehlenhoff) 05Resolved→03Open I wanted to re-add the node to the ganeti cluster, but it seems after the mainboard replacement virtualisation is no longer enabled in BIOS, can you please enable that?
[11:53:15] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/956410 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[11:54:24] <wikibugs>	 (03PS1) 10Ladsgroup: Add drop_notification_seen_T337310.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/956412 (https://phabricator.wikimedia.org/T337310)
[11:59:49] <wikibugs>	 (03CR) 10Btullis: Refactor spark support to build multiple minor versions (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/956374 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[12:00:15] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] Add drop_notification_seen_T337310.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/956412 (https://phabricator.wikimedia.org/T337310) (owner: 10Ladsgroup)
[12:03:05] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ahoelzl - https://phabricator.wikimedia.org/T345959 (10Aklapper) Hi (and welcome)! The Phabricator account @Ahoelzl is currently connected to a [personal MediaWiki account](https://phabricator.wikimedia.org/p/Ahoelzl/) and not t...
[12:04:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/956411 (https://phabricator.wikimedia.org/T300955) (owner: 10Ayounsi)
[12:06:22] <wikibugs>	 (03PS1) 10JMeybohm: kubernetes::master: Switch staging to use PKI for SA signing [puppet] - 10https://gerrit.wikimedia.org/r/956414 (https://phabricator.wikimedia.org/T329826)
[12:07:38] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43195/console" [puppet] - 10https://gerrit.wikimedia.org/r/956414 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[12:08:00] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wmcs: drop cloudservices1004 addresses [dns] - 10https://gerrit.wikimedia.org/r/956415 (https://phabricator.wikimedia.org/T342621)
[12:08:07] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "let me know if you want me to merge this" [puppet] - 10https://gerrit.wikimedia.org/r/948203 (owner: 10Amire80)
[12:09:45] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] kubernetes::master: Switch staging to use PKI for SA signing [puppet] - 10https://gerrit.wikimedia.org/r/956414 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[12:11:15] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1054.eqiad.wmnet
[12:11:28] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2054.codfw.wmnet
[12:13:44] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Remove python3-build-jessie (Jessie is EOL) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/941442 (owner: 10Hashar)
[12:14:48] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: refresh cloudservices1006 ns address [puppet] - 10https://gerrit.wikimedia.org/r/956417 (https://phabricator.wikimedia.org/T342621)
[12:15:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org
[12:17:34] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1054.eqiad.wmnet
[12:17:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/956411 (https://phabricator.wikimedia.org/T300955) (owner: 10Ayounsi)
[12:18:08] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2054.codfw.wmnet
[12:18:39] <wikibugs>	 (03Abandoned) 10Hashar: ci: enabling docker requires the docker-ce package [puppet] - 10https://gerrit.wikimedia.org/r/935471 (https://phabricator.wikimedia.org/T341051) (owner: 10Hashar)
[12:18:43] <wikibugs>	 (03Abandoned) 10Hashar: ci: setup dockervolume before Docker daemon [puppet] - 10https://gerrit.wikimedia.org/r/935405 (https://phabricator.wikimedia.org/T341051) (owner: 10Hashar)
[12:18:45] <moritzm>	 !log installing libssh2 security updates
[12:18:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:20:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:21:51] <moritzm>	 !log restarting apache/FPM on mediawiki canaries
[12:21:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:22:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org
[12:23:36] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: increase the recommendation-api-ng memory usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/956017 (https://phabricator.wikimedia.org/T339890)
[12:25:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1137.eqiad.wmnet with reason: Maintenance
[12:25:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:25:29] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1137.eqiad.wmnet with reason: Maintenance
[12:25:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1137 (T337310)', diff saved to https://phabricator.wikimedia.org/P52408 and previous config saved to /var/cache/conftool/dbconfig/20230911-122535-ladsgroup.json
[12:25:39] <stashbot>	 T337310: Remove notification_seen column from echo_notifications database table - https://phabricator.wikimedia.org/T337310
[12:25:39] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[12:26:55] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wmcs: refresh DNS addresses [puppet] - 10https://gerrit.wikimedia.org/r/956419 (https://phabricator.wikimedia.org/T342621)
[12:27:19] <taavi>	 jouncebot: nowandnext
[12:27:19] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 32 minute(s)
[12:27:19] <jouncebot>	 In 0 hour(s) and 32 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230911T1300)
[12:29:44] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] openstack: refresh cloudservices1006 ns address [puppet] - 10https://gerrit.wikimedia.org/r/956417 (https://phabricator.wikimedia.org/T342621) (owner: 10Arturo Borrero Gonzalez)
[12:30:32] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.dns.netbox
[12:31:57] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] wmcs: refresh DNS addresses [puppet] - 10https://gerrit.wikimedia.org/r/956419 (https://phabricator.wikimedia.org/T342621) (owner: 10Arturo Borrero Gonzalez)
[12:31:59] <wikibugs>	 (03PS13) 10Slyngshede: Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160
[12:32:08] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: refresh cloudservices1006 ns address [puppet] - 10https://gerrit.wikimedia.org/r/956417 (https://phabricator.wikimedia.org/T342621) (owner: 10Arturo Borrero Gonzalez)
[12:32:12] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/956415 (https://phabricator.wikimedia.org/T342621) (owner: 10Arturo Borrero Gonzalez)
[12:32:16] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: refresh DNS addresses [puppet] - 10https://gerrit.wikimedia.org/r/956419 (https://phabricator.wikimedia.org/T342621) (owner: 10Arturo Borrero Gonzalez)
[12:33:32] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: drop cloudservices1004 addresses [dns] - 10https://gerrit.wikimedia.org/r/956415 (https://phabricator.wikimedia.org/T342621) (owner: 10Arturo Borrero Gonzalez)
[12:35:52] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Add drop_notification_seen_T337310.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/956412 (https://phabricator.wikimedia.org/T337310) (owner: 10Ladsgroup)
[12:36:30] <wikibugs>	 (03PS7) 10AOkoth: vrts: apply role and setup hiera values [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027)
[12:36:48] <wikibugs>	 (03Merged) 10jenkins-bot: Add drop_notification_seen_T337310.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/956412 (https://phabricator.wikimedia.org/T337310) (owner: 10Ladsgroup)
[12:37:03] <wikibugs>	 (03CR) 10AOkoth: vrts: apply role and setup hiera values (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth)
[12:37:07] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices1006 - aborrero@cumin1001"
[12:37:57] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices1006 - aborrero@cumin1001"
[12:37:58] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:39:45] <wikibugs>	 (03PS1) 10Jelto: gitlab: use UUID in provision filesystem script [puppet] - 10https://gerrit.wikimedia.org/r/956422
[12:39:48] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wmcs: eqiad1: drop ns1-next and use ns1 [puppet] - 10https://gerrit.wikimedia.org/r/956423 (https://phabricator.wikimedia.org/T345240)
[12:40:28] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: eqiad1: drop ns1-next and use ns1 [puppet] - 10https://gerrit.wikimedia.org/r/956423 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez)
[12:52:12] <wikibugs>	 (03PS1) 10Majavah: cr-cloud: add ns-recursor.openstack.eqiad1 [homer/public] - 10https://gerrit.wikimedia.org/r/956429 (https://phabricator.wikimedia.org/T342621)
[12:53:30] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cr-cloud: add ns-recursor.openstack.eqiad1 [homer/public] - 10https://gerrit.wikimedia.org/r/956429 (https://phabricator.wikimedia.org/T342621) (owner: 10Majavah)
[12:59:27] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2055.codfw.wmnet
[12:59:33] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) firing: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[12:59:49] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1037.eqiad.wmnet
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: That opportune time is upon us again. Time for a UTC afternoon backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230911T1300).
[13:00:04] <jouncebot>	 Func, kart_, and abijeet: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:13] <Func>	 o/
[13:00:18] <kart_>	 \0
[13:01:00] <kart_>	 I will also deploy abijeet's patch.
[13:01:14] <Lucas_WMDE>	 ok!
[13:01:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:01:53] <wikibugs>	 10Puppet: pg replication lag UNKNOWN for puppetdb2003 - https://phabricator.wikimedia.org/T346016 (10jbond) 05Open→03In progress p:05Triage→03Medium
[13:01:59] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] noc: remove from role::maintenance [puppet] - 10https://gerrit.wikimedia.org/r/944920 (owner: 10Giuseppe Lavagetto)
[13:02:06] <kart_>	 Func: You can start with your patch.
[13:02:25] <Func>	 I am not a deployer ;)
[13:04:20] <taavi>	 i am semi-around but busy with a wmcs issue, sorry
[13:04:24] <kart_>	 ah. I have no idea about patch. Anyone else can deploy it?
[13:04:37] <kart_>	 Lucas_WMDE: ^^
[13:05:01] <Lucas_WMDE>	 I’d prefer not to deploy, but let me take a look
[13:05:57] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2055.codfw.wmnet
[13:06:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:06:25] <Lucas_WMDE>	 ok let’s try it I guess
[13:06:27] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1037.eqiad.wmnet
[13:06:35] <wikibugs>	 (03PS1) 10Gehel: Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/956432 (https://phabricator.wikimedia.org/T342361)
[13:06:38] <Lucas_WMDE>	 Func: if the logspam recurs I assume it’ll be noticeable quickly and we can safely revert again?
[13:06:50] <Func>	 yeah
[13:07:13] <Lucas_WMDE>	 ok, good enough for me
[13:07:14] <TheresNoTime>	 (fwiw the internal DNS issues are at the very least causing beta cluster deploys to fail, so ymmv)
[13:07:18] <RhinosF1>	 Lucas_WMDE: as an fyi, beta CI is broken
[13:07:23] <Lucas_WMDE>	 ok
[13:07:29] <Lucas_WMDE>	 but it’s not expected to affect production right?
[13:07:29] <RhinosF1>	 I'm not sure how normal CI works
[13:07:42] <Lucas_WMDE>	 let’s try it out
[13:07:47] <Lucas_WMDE>	 if it fails I’ll know why, thanks
[13:07:56] <RhinosF1>	 Lucas_WMDE: no production impact but if CI throws weird errors, it's known
[13:08:02] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): Reapply "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956050 (https://phabricator.wikimedia.org/T340697) (owner: 10Func)
[13:08:06] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956050 (https://phabricator.wikimedia.org/T340697) (owner: 10Func)
[13:08:51] <wikibugs>	 (03Merged) 10jenkins-bot: Reapply "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956050 (https://phabricator.wikimedia.org/T340697) (owner: 10Func)
[13:09:10] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:956050|Reapply "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace" (T340697)]]
[13:09:18] <stashbot>	 T340697: Remove $wgNamespacesWithSubpages overrides for the MediaWiki namespace in production - https://phabricator.wikimedia.org/T340697
[13:09:18] <wikibugs>	 (03PS3) 10KartikMistry: Enable MinT translation service in more wikis - rollout #3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956051 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro)
[13:10:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137 (T337310)', diff saved to https://phabricator.wikimedia.org/P52409 and previous config saved to /var/cache/conftool/dbconfig/20230911-131001-ladsgroup.json
[13:10:08] <stashbot>	 T337310: Remove notification_seen column from echo_notifications database table - https://phabricator.wikimedia.org/T337310
[13:10:38] <Lucas_WMDE>	 Func: I’m confused by some of the diffConfig output https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-diffConfig-docker/4260/console
[13:10:43] <Lucas_WMDE>	 there seem to be a lot of "14"s affected
[13:11:01] <Lucas_WMDE>	 and also e.g. conf-production-zh_yuewiki.json has "16" removed too
[13:11:14] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 func and lucaswerkmeister-wmde: Backport for [[gerrit:956050|Reapply "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace" (T340697)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[13:11:35] <Lucas_WMDE>	 is this really correct?
[13:11:44] <Lucas_WMDE>	 (I should’ve checked this before +2ing, really)
[13:11:46] <Func>	 eh le me check
[13:12:21] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] makevm: handle sandbox vlan [cookbooks] - 10https://gerrit.wikimedia.org/r/955730 (https://phabricator.wikimedia.org/T307021) (owner: 10Ayounsi)
[13:13:20] <Lucas_WMDE>	 in `mwscript shell zuwikibooks`, `$namespaceInfo->hasSubpages(14)` returns false on mwdebug1002
[13:13:39] <Lucas_WMDE>	 but true on mwmaint1002
[13:13:58] <Lucas_WMDE>	 (14 being NS_CATEGORY)
[13:14:44] <wikibugs>	 (03Merged) 10jenkins-bot: makevm: handle sandbox vlan [cookbooks] - 10https://gerrit.wikimedia.org/r/955730 (https://phabricator.wikimedia.org/T307021) (owner: 10Ayounsi)
[13:15:27] <Func>	 Lucas_WMDE: eh, I don't know how that is possible, maybe we don't deploy this time
[13:15:31] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: noc: remove profile, module [puppet] - 10https://gerrit.wikimedia.org/r/944921
[13:16:06] <Lucas_WMDE>	 Func: yeah, I don’t understand it either
[13:16:14] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.ganeti.makevm for new host atlas3001.wikimedia.org
[13:16:15] <Lucas_WMDE>	 I’ll say `n` to scap backport
[13:16:15] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox
[13:16:22] <Lucas_WMDE>	 and find out if it reverts itself or if I need to upload a revert manually ^^
[13:16:26] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Sync cancelled.
[13:16:33] <Lucas_WMDE>	 ok, it just cancels the sync
[13:16:35] * Lucas_WMDE reverts
[13:16:41] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] noc: remove profile, module [puppet] - 10https://gerrit.wikimedia.org/r/944921 (owner: 10Giuseppe Lavagetto)
[13:17:05] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Revert "Reapply "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956057
[13:17:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956057 (owner: 10Lucas Werkmeister (WMDE))
[13:18:00] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "(I should have checked the diffConfig before merging that other change, my bad. It shouldn’t have been merged at all, then this revert wou" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956057 (owner: 10Lucas Werkmeister (WMDE))
[13:18:02] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Reapply "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956057 (owner: 10Lucas Werkmeister (WMDE))
[13:18:18] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:956057|Revert "Reapply "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace""]]
[13:18:25] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM atlas3001.wikimedia.org - ayounsi@cumin1001"
[13:19:11] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM atlas3001.wikimedia.org - ayounsi@cumin1001"
[13:19:12] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:19:12] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.dns.wipe-cache atlas3001.wikimedia.org on all recursors
[13:19:15] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) atlas3001.wikimedia.org on all recursors
[13:19:33] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) resolved: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[13:19:42] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM atlas3001.wikimedia.org - ayounsi@cumin1001"
[13:19:42] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:956057|Revert "Reapply "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace""]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[13:20:07] <Lucas_WMDE>	 I’ll let this sync go through so I’m sure everything is on the same page
[13:20:08] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Continuing with sync
[13:20:15] <Lucas_WMDE>	 (though it shouldn’t be necessary, strictly speaking)
[13:20:22] <kart_>	 I wish scap backport will always say Y :)
[13:20:30] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM atlas3001.wikimedia.org - ayounsi@cumin1001"
[13:20:30] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host atlas3001.wikimedia.org
[13:20:42] <icinga-wm>	 PROBLEM - Host ganeti2014 is DOWN: PING CRITICAL - Packet loss = 100%
[13:20:56] <kart_>	 Lucas_WMDE: let me know when scap is done.
[13:21:21] <Lucas_WMDE>	 will do
[13:22:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1119', diff saved to https://phabricator.wikimedia.org/P52411 and previous config saved to /var/cache/conftool/dbconfig/20230911-132210-root.json
[13:24:02] <wikibugs>	 10SRE, 10ops-codfw: ganeti2014: broken RAM - https://phabricator.wikimedia.org/T341546 (10Jhancock.wm) 05Open→03Resolved @MoritzMuehlenhoff it's enabled now.
[13:24:22] <icinga-wm>	 RECOVERY - Host ganeti2014 is UP: PING OK - Packet loss = 0%, RTA = 33.18 ms
[13:25:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137', diff saved to https://phabricator.wikimedia.org/P52412 and previous config saved to /var/cache/conftool/dbconfig/20230911-132507-ladsgroup.json
[13:26:17] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-wf2001.codfw.wmnet
[13:26:23] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:956057|Revert "Reapply "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace""]] (duration: 08m 04s)
[13:26:45] <kart_>	 OK. It seems done.
[13:27:24] <kart_>	 I'll deploy abijeet's patch. Skipping my patch.
[13:27:41] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43201/console" [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth)
[13:27:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956051 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro)
[13:28:16] * Lucas_WMDE done
[13:28:18] <Lucas_WMDE>	 kart_: go ahead
[13:28:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/956410 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[13:28:28] <Lucas_WMDE>	 (sorry, I got distracted for a minute)
[13:28:30] <wikibugs>	 (03Merged) 10jenkins-bot: Enable MinT translation service in more wikis - rollout #3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956051 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro)
[13:28:33] <abijeet>	 kart_, thanks!
[13:28:48] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:956051|Enable MinT translation service in more wikis - rollout #3 (T341445)]]
[13:28:51] <stashbot>	 T341445: Enable MinT for translatable pages - https://phabricator.wikimedia.org/T341445
[13:29:35] <kart_>	 Lucas_WMDE: No problem!
[13:30:13] <logmsgbot>	 !log kartik@deploy1002 kartik and abi: Backport for [[gerrit:956051|Enable MinT translation service in more wikis - rollout #3 (T341445)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[13:30:34] <kart_>	 abijeet: can you test the patch using mwdebug?
[13:30:46] <abijeet>	 kart_, sure.
[13:31:29] <kart_>	 Let me know if eveything is OK
[13:32:14] <wikibugs>	 (03CR) 10LSobanski: [C: 03+1] gitlab: use UUID in provision filesystem script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956422 (owner: 10Jelto)
[13:32:14] <abijeet>	 kart_, looks good.
[13:32:45] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf2001.codfw.wmnet
[13:33:03] <kart_>	 cool. Going ahead.
[13:33:42] <wikibugs>	 (03PS1) 10Func: [WIP] Re-reapply "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956058
[13:33:44] <logmsgbot>	 !log kartik@deploy1002 kartik and abi: Continuing with sync
[13:33:50] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder)
[13:34:11] <wikibugs>	 (03PS2) 10Func: [WIP] Re-reapply "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956058
[13:36:13] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+1] "looks mostly good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth)
[13:36:46] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2045.codfw.wmnet
[13:40:07] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:956051|Enable MinT translation service in more wikis - rollout #3 (T341445)]] (duration: 11m 18s)
[13:40:10] <stashbot>	 T341445: Enable MinT for translatable pages - https://phabricator.wikimedia.org/T341445
[13:40:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137', diff saved to https://phabricator.wikimedia.org/P52413 and previous config saved to /var/cache/conftool/dbconfig/20230911-134013-ladsgroup.json
[13:40:55] <kart_>	 We are done now, abijeet :)
[13:41:44] <abijeet>	 kart_, thanks
[13:43:15] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2045.codfw.wmnet
[13:43:18] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-wf1001.eqiad.wmnet
[13:43:20] <icinga-wm>	 RECOVERY - Host mw2444 is UP: PING OK - Packet loss = 0%, RTA = 33.58 ms
[13:43:24] <icinga-wm>	 PROBLEM - Check systemd state on mw2444 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-phpfpm-statustext-textfile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:43:26] <icinga-wm>	 PROBLEM - puppet last run on mw2444 is CRITICAL: CRITICAL: Puppet last ran 15 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:44:32] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops: mw2444 down - https://phabricator.wikimedia.org/T345884 (10Jhancock.wm) I opened a Dell support ticket to get a replacement. I've rebooted it for now but expect it to go down again.  SR: 175669963
[13:44:52] <icinga-wm>	 RECOVERY - Check systemd state on mw2444 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:47:37] <wikibugs>	 (03PS2) 10Elukey: profile::service_proxy::envoy: rename uses_ingress to sets_sni [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T339890)
[13:48:52] <icinga-wm>	 RECOVERY - puppet last run on mw2444 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:49:44] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf1001.eqiad.wmnet
[13:49:47] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-wf1002.eqiad.wmnet
[13:51:10] <wikibugs>	 (03PS1) 10Ayounsi: Add atlas_group (VMs) to RIPE atlas policy [homer/public] - 10https://gerrit.wikimedia.org/r/956435 (https://phabricator.wikimedia.org/T307021)
[13:51:26] <wikibugs>	 (03PS2) 10Jelto: gitlab: use UUID in provision filesystem script [puppet] - 10https://gerrit.wikimedia.org/r/956422
[13:52:38] <wikibugs>	 (03CR) 10Jelto: gitlab: use UUID in provision filesystem script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956422 (owner: 10Jelto)
[13:54:46] <wikibugs>	 (03CR) 10LSobanski: gitlab: use UUID in provision filesystem script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956422 (owner: 10Jelto)
[13:55:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137 (T337310)', diff saved to https://phabricator.wikimedia.org/P52414 and previous config saved to /var/cache/conftool/dbconfig/20230911-135520-ladsgroup.json
[13:55:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1216.eqiad.wmnet with reason: Maintenance
[13:55:24] <stashbot>	 T337310: Remove notification_seen column from echo_notifications database table - https://phabricator.wikimedia.org/T337310
[13:55:35] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1216.eqiad.wmnet with reason: Maintenance
[13:55:36] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf1002.eqiad.wmnet
[13:55:39] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-wf2002.codfw.wmnet
[13:56:04] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] gitlab: use UUID in provision filesystem script [puppet] - 10https://gerrit.wikimedia.org/r/956422 (owner: 10Jelto)
[13:56:20] <wikibugs>	 (03PS1) 10Majavah: Revert "cr-cloud: add ns-recursor.openstack.eqiad1" [homer/public] - 10https://gerrit.wikimedia.org/r/956439
[13:57:07] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM, did not test it though, let me know if you want a thorough test" [puppet] - 10https://gerrit.wikimedia.org/r/956088 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott)
[13:59:05] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf2002.codfw.wmnet
[13:59:56] <wikibugs>	 (03CR) 10Herron: [C: 03+1] superset: Move superset metrics to statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790) (owner: 10Andrea Denisse)
[14:02:33] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Fab! Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790) (owner: 10Andrea Denisse)
[14:05:23] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/955973 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott)
[14:06:12] <wikibugs>	 (03PS1) 10Elukey: modules: add configuration 1.5.0 to mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/956440
[14:06:14] <wikibugs>	 (03PS1) 10Elukey: modules: add configuration 1.5.0 to mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/956441 (https://phabricator.wikimedia.org/T339890)
[14:07:22] <wikibugs>	 (03CR) 10AOkoth: vrts: apply role and setup hiera values (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth)
[14:07:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:07:40] <wikibugs>	 (03PS2) 10Elukey: modules: copy configuration 1.4.1 to 1.5.0 for mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/956440
[14:07:42] <wikibugs>	 (03PS2) 10Elukey: modules: add configuration 1.5.0 to mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/956441 (https://phabricator.wikimedia.org/T339890)
[14:07:44] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43202/console" [puppet] - 10https://gerrit.wikimedia.org/r/954063 (https://phabricator.wikimedia.org/T264231) (owner: 10Hashar)
[14:08:31] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: Revert "cr-cloud: add ns-recursor.openstack.eqiad1" [homer/public] - 10https://gerrit.wikimedia.org/r/956059
[14:09:04] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] Revert "cr-cloud: add ns-recursor.openstack.eqiad1" [homer/public] - 10https://gerrit.wikimedia.org/r/956059 (owner: 10Arturo Borrero Gonzalez)
[14:11:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10Eevans) >>! In T344259#9152935, @Eevans wrote: >>>! In T344259#9152542, @Jclark-ctr wrote: >> Replaced optic and cable again  @cmooney @Eevans  >  > Thanks @Jclark-ctr.  U...
[14:11:26] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Add a dependency on the opensearch-py client - https://phabricator.wikimedia.org/T345900 (10colewhite) Linking my comment here for visibility: T345337#9150551
[14:11:31] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "cr-cloud: add ns-recursor.openstack.eqiad1" [homer/public] - 10https://gerrit.wikimedia.org/r/956059 (owner: 10Arturo Borrero Gonzalez)
[14:12:08] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm. Keep in mind that GitLab::Projects contains issues, wiki and snippets currently. So if we want to disable more, we have expand the t" [puppet] - 10https://gerrit.wikimedia.org/r/954063 (https://phabricator.wikimedia.org/T264231) (owner: 10Hashar)
[14:15:32] <wikibugs>	 (03PS3) 10Elukey: modules: add configuration 1.5.0 to mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/956441 (https://phabricator.wikimedia.org/T339890)
[14:15:46] <Lucas_WMDE>	 Func: aaaaaAAAAAHHH!
[14:15:47] <Lucas_WMDE>	 (re https://phabricator.wikimedia.org/T340697#9156521)
[14:15:53] <Lucas_WMDE>	 that’s terrifying
[14:16:32] <Func>	 yeah, it even affects $wgNamespaceProtection
[14:17:32] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:17:43] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: add default_project_features parameter to profile [puppet] - 10https://gerrit.wikimedia.org/r/954063 (https://phabricator.wikimedia.org/T264231) (owner: 10Hashar)
[14:17:52] * Lucas_WMDE sprays mediawiki-config with holy water
[14:18:31] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: machinetranslation: Add egress mesh template [deployment-charts] - 10https://gerrit.wikimedia.org/r/956444 (https://phabricator.wikimedia.org/T335491)
[14:19:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add: roll restart/reboot command for opensearch cluster nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol)
[14:19:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] machinetranslation: Add egress mesh template [deployment-charts] - 10https://gerrit.wikimedia.org/r/956444 (https://phabricator.wikimedia.org/T335491) (owner: 10Alexandros Kosiaris)
[14:19:34] <logmsgbot>	 !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet2005-dev.codfw.wmnet with OS bookworm
[14:23:10] <wikibugs>	 (03CR) 10Milimetric: "@Daniel - just added you since I didn't see this setting/content handler mapping used anywhere, and wondered what you thought about this i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955819 (https://phabricator.wikimedia.org/T345874) (owner: 10Milimetric)
[14:24:20] <wikibugs>	 (03PS8) 10AOkoth: vrts: apply role and setup hiera values [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027)
[14:25:20] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero)
[14:28:19] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: DIMM failure for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T345802 (10Jhancock.wm) 05Open→03Resolved @Clement_Goubert the defective DIMM has been replaced and booted up. Error hasn't repeated yet.  `The self-heal operation suc...
[14:28:54] <wikibugs>	 (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/953631/43203/" [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth)
[14:30:08] <wikibugs>	 (03PS1) 10Mhorsey: Enable Campaign Events email feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956447 (https://phabricator.wikimedia.org/T345704)
[14:30:32] <wikibugs>	 (03CR) 10Mhorsey: [C: 04-1] "Do not merge until deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956447 (https://phabricator.wikimedia.org/T345704) (owner: 10Mhorsey)
[14:30:41] <wikibugs>	 (03PS1) 10JMeybohm: kubernetes::master: Switch staging to use PKI for SA signing [puppet] - 10https://gerrit.wikimedia.org/r/956449 (https://phabricator.wikimedia.org/T329826)
[14:30:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1220.eqiad.wmnet with reason: Maintenance
[14:30:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1220.eqiad.wmnet with reason: Maintenance
[14:31:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1220 (T337310)', diff saved to https://phabricator.wikimedia.org/P52416 and previous config saved to /var/cache/conftool/dbconfig/20230911-143102-ladsgroup.json
[14:31:06] <stashbot>	 T337310: Remove notification_seen column from echo_notifications database table - https://phabricator.wikimedia.org/T337310
[14:33:11] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: DIMM failure for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T345802 (10Clement_Goubert) Thanks @Jhancock.wm !
[14:33:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10aborrero)
[14:34:56] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43204/console" [puppet] - 10https://gerrit.wikimedia.org/r/956449 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[14:35:58] <wikibugs>	 (03PS9) 10AOkoth: vrts: apply role and setup hiera values [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027)
[14:37:47] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero)
[14:38:38] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero)
[14:39:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1220 (T337310)', diff saved to https://phabricator.wikimedia.org/P52417 and previous config saved to /var/cache/conftool/dbconfig/20230911-143937-ladsgroup.json
[14:39:41] <stashbot>	 T337310: Remove notification_seen column from echo_notifications database table - https://phabricator.wikimedia.org/T337310
[14:40:23] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero) 05Open→03In progress p:05Triage→03Medium
[14:40:29] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero)
[14:42:23] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: machinetranslation: Add egress mesh template [deployment-charts] - 10https://gerrit.wikimedia.org/r/956444 (https://phabricator.wikimedia.org/T335491)
[14:42:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] machinetranslation: Add egress mesh template [deployment-charts] - 10https://gerrit.wikimedia.org/r/956444 (https://phabricator.wikimedia.org/T335491) (owner: 10Alexandros Kosiaris)
[14:48:29] <wikibugs>	 (03PS1) 10Andrea Denisse: netmon: Failover from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/956452 (https://phabricator.wikimedia.org/T344136)
[14:51:27] <wikibugs>	 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: sre.hosts.reimage: fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10fnegri) I'm running into a similar issue while reimaging `cloudnet2005-dev.codfw.wmnet` to Bookworm.  ` fnegri@cumin1001:~$ sudo cookbook sre.hosts.re...
[14:51:49] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Revert "cr-cloud: add ns-recursor.openstack.eqiad1" [homer/public] - 10https://gerrit.wikimedia.org/r/956439 (owner: 10Majavah)
[14:52:05] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add atlas_group (VMs) to RIPE atlas policy [homer/public] - 10https://gerrit.wikimedia.org/r/956435 (https://phabricator.wikimedia.org/T307021) (owner: 10Ayounsi)
[14:52:33] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: machinetranslation: Add egress mesh template [deployment-charts] - 10https://gerrit.wikimedia.org/r/956444 (https://phabricator.wikimedia.org/T335491)
[14:53:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] machinetranslation: Add egress mesh template [deployment-charts] - 10https://gerrit.wikimedia.org/r/956444 (https://phabricator.wikimedia.org/T335491) (owner: 10Alexandros Kosiaris)
[14:53:40] <wikibugs>	 (03PS1) 10Ayounsi: Add esams sandbox network prefixes [puppet] - 10https://gerrit.wikimedia.org/r/956454 (https://phabricator.wikimedia.org/T307021)
[14:54:11] <logmsgbot>	 !log brouberol@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1149.eqiad.wmnet
[14:54:14] <wikibugs>	 (03CR) 10Jelto: vrts: apply role and setup hiera values (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth)
[14:54:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1220', diff saved to https://phabricator.wikimedia.org/P52418 and previous config saved to /var/cache/conftool/dbconfig/20230911-145443-ladsgroup.json
[14:55:26] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] mw-on-k8s: Raise traffic to 5% [puppet] - 10https://gerrit.wikimedia.org/r/956390 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert)
[14:55:38] <icinga-wm>	 PROBLEM - Check systemd state on kubestagemaster1002 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:56:12] <logmsgbot>	 !log brouberol@cumin1001 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1149.eqiad.wmnet
[14:57:00] <icinga-wm>	 RECOVERY - Check systemd state on kubestagemaster1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:57:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, we'll need the DNS change too to go with this, like https://gerrit.wikimedia.org/r/c/operations/dns/+/616709 (no smokeping anymore t" [puppet] - 10https://gerrit.wikimedia.org/r/956452 (https://phabricator.wikimedia.org/T344136) (owner: 10Andrea Denisse)
[14:57:28] <wikibugs>	 (03CR) 10Effie Mouzeli: mw-api-ext, mw-web: Raise total replicas to 14 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/956388 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert)
[14:58:25] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero) hey @Jclark-ctr or @Jhancock.wm it would be good for us to know when this reracking can be done in advance, to have the less downtime in...
[14:59:44] <jinxer-wm>	 (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[15:00:31] * volans looking
[15:00:58] <wikibugs>	 (03PS3) 10Clément Goubert: mw-api-ext, mw-web: Raise total replicas to 14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/956388 (https://phabricator.wikimedia.org/T341780)
[15:01:02] <wikibugs>	 (03CR) 10Clément Goubert: mw-api-ext, mw-web: Raise total replicas to 14 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/956388 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert)
[15:01:20] <wikibugs>	 (03PS1) 10Andrea Denisse: wikimedia: Failover LibreNMS from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/956455 (https://phabricator.wikimedia.org/T344136)
[15:02:30] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1031.mgmt.eqiad.wmnet with reboot policy FORCED
[15:03:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, we're not lowering the TTL but I think that's good enough, we can force-refresh as needed" [dns] - 10https://gerrit.wikimedia.org/r/956455 (https://phabricator.wikimedia.org/T344136) (owner: 10Andrea Denisse)
[15:03:31] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[15:04:01] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[15:04:08] <wikibugs>	 (03PS2) 10Andrea Denisse: wikimedia: Failover LibreNMS from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/956455 (https://phabricator.wikimedia.org/T344136)
[15:04:44] <jinxer-wm>	 (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[15:05:59] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes1029.mgmt.eqiad.wmnet with reboot policy FORCED
[15:06:13] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops: Fail event on /dev/md/0:kubernetes2028 - https://phabricator.wikimedia.org/T345853 (10Jhancock.wm) After working with Dell, we determined that the drive is bad and they will be sending a replacement.
[15:06:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jclark-ctr)
[15:06:20] <icinga-wm>	 PROBLEM - Check systemd state on kubestagemaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:07:33] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1037.mgmt.eqiad.wmnet with reboot policy FORCED
[15:07:33] <wikibugs>	 10SRE, 10ops-codfw, 10Data-Platform-SRE: DegradedArray event on /dev/md/0:wdqs2024 - https://phabricator.wikimedia.org/T345542 (10Jhancock.wm) p:05Triage→03Medium
[15:07:35] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1029.mgmt.eqiad.wmnet with reboot policy FORCED
[15:07:36] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1038.mgmt.eqiad.wmnet with reboot policy FORCED
[15:07:38] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1039.mgmt.eqiad.wmnet with reboot policy FORCED
[15:07:39] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1040.mgmt.eqiad.wmnet with reboot policy FORCED
[15:07:40] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1041.mgmt.eqiad.wmnet with reboot policy FORCED
[15:07:42] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1042.mgmt.eqiad.wmnet with reboot policy FORCED
[15:07:43] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1043.mgmt.eqiad.wmnet with reboot policy FORCED
[15:07:45] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1044.mgmt.eqiad.wmnet with reboot policy FORCED
[15:07:52] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops: Fail event on /dev/md/0:kubernetes2028 - https://phabricator.wikimedia.org/T345853 (10Jhancock.wm) p:05Triage→03Medium
[15:09:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1220', diff saved to https://phabricator.wikimedia.org/P52419 and previous config saved to /var/cache/conftool/dbconfig/20230911-150950-ladsgroup.json
[15:15:06] <icinga-wm>	 RECOVERY - Check systemd state on kubestagemaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:18:09] <logmsgbot>	 !log jnuche@deploy1002 Installing scap version "4.60.0" for 595 hosts
[15:18:34] <wikibugs>	 (03Abandoned) 10Arturo Borrero Gonzalez: Revert "cr-cloud: add ns-recursor.openstack.eqiad1" [homer/public] - 10https://gerrit.wikimedia.org/r/956439 (owner: 10Majavah)
[15:19:21] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes1029.mgmt.eqiad.wmnet with reboot policy FORCED
[15:20:57] <logmsgbot>	 !log jnuche@deploy1002 Installing scap version "4.60.0" for 595 hosts
[15:21:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] ferm: Move more files under the service check conditional [puppet] - 10https://gerrit.wikimedia.org/r/956410 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[15:21:59] <logmsgbot>	 !log jnuche@deploy1002 Installation of scap version "4.60.0" completed for 595 hosts
[15:23:17] <logmsgbot>	 !log fnegri@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet2005-dev.codfw.wmnet with OS bookworm
[15:24:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1220 (T337310)', diff saved to https://phabricator.wikimedia.org/P52420 and previous config saved to /var/cache/conftool/dbconfig/20230911-152456-ladsgroup.json
[15:24:58] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[15:25:01] <stashbot>	 T337310: Remove notification_seen column from echo_notifications database table - https://phabricator.wikimedia.org/T337310
[15:25:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[15:25:40] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero)
[15:25:53] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1038.mgmt.eqiad.wmnet with reboot policy FORCED
[15:25:56] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1042.mgmt.eqiad.wmnet with reboot policy FORCED
[15:25:59] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1041.mgmt.eqiad.wmnet with reboot policy FORCED
[15:26:01] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1039.mgmt.eqiad.wmnet with reboot policy FORCED
[15:27:00] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1044.mgmt.eqiad.wmnet with reboot policy FORCED
[15:28:26] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1037.mgmt.eqiad.wmnet with reboot policy FORCED
[15:28:29] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1043.mgmt.eqiad.wmnet with reboot policy FORCED
[15:30:05] <jouncebot>	 jan_drewniak: #bothumor I � Unicode. All rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230911T1530).
[15:31:35] <damilare>	 hello! Please who can I reach out to add an apple verification file to the .well-known directory in donate.wiki
[15:32:10] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF)
[15:33:52] <wikibugs>	 (03PS1) 10Muehlenhoff: Pass ensure->present to the nftables class if selected [puppet] - 10https://gerrit.wikimedia.org/r/956461
[15:34:05] <RhinosF1>	 damilare: probably best asked in the less noisy #wikimedia-sre
[15:34:21] <RhinosF1>	 Unless fundraising own it
[15:35:06] <damilare>	 thanks RhinosF1, looks like it's more on the side of prod ops. Thanks for the link.
[15:36:04] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Refactor spark support to build multiple minor versions [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/956374 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[15:36:33] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1040.mgmt.eqiad.wmnet with reboot policy FORCED
[15:37:52] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/956461 (owner: 10Muehlenhoff)
[15:40:51] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1052.mgmt.eqiad.wmnet with reboot policy FORCED
[15:41:01] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1051.mgmt.eqiad.wmnet with reboot policy FORCED
[15:41:17] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1053.mgmt.eqiad.wmnet with reboot policy FORCED
[15:41:20] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1049.mgmt.eqiad.wmnet with reboot policy FORCED
[15:41:23] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1050.mgmt.eqiad.wmnet with reboot policy FORCED
[15:41:25] <wikibugs>	 (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/955961/43205/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/955961 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[15:41:27] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1045.mgmt.eqiad.wmnet with reboot policy FORCED
[15:41:29] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1046.mgmt.eqiad.wmnet with reboot policy FORCED
[15:41:42] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1048.mgmt.eqiad.wmnet with reboot policy FORCED
[15:41:46] <logmsgbot>	 !log brouberol@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1149.eqiad.wmnet
[15:43:08] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance
[15:43:22] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance
[15:43:28] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T343198)', diff saved to https://phabricator.wikimedia.org/P52421 and previous config saved to /var/cache/conftool/dbconfig/20230911-154327-arnaudb.json
[15:43:31] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[15:43:48] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.reimage for host netmon2002.wikimedia.org with OS bookworm
[15:44:12] <logmsgbot>	 !log brouberol@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1149.eqiad.wmnet
[15:45:07] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[15:45:37] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+1] "Post hoc, but this seems reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/954072 (https://phabricator.wikimedia.org/T264231) (owner: 10Hashar)
[15:47:18] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt kubernetes1047 - jclark@cumin1001"
[15:48:05] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt kubernetes1047 - jclark@cumin1001"
[15:48:05] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:48:33] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1047.mgmt.eqiad.wmnet with reboot policy FORCED
[15:48:51] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] "excellent!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/956388 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert)
[15:49:47] <wikibugs>	 (03PS1) 10Func: composer: Install symfony/polyfill-php8x [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956022
[15:49:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10MoritzMuehlenhoff) Quick status update; this has seen agreement in the IF SRE meeting, the next step is to sort out which SRE would take care the day-...
[15:51:33] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:55:09] <wikibugs>	 (03CR) 10Herron: [C: 03+2] profile::prometheus::statsd_exporter: add support for empty mappings [puppet] - 10https://gerrit.wikimedia.org/r/955838 (https://phabricator.wikimedia.org/T345377) (owner: 10Herron)
[15:55:18] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wmcs: remove ns-recursorX FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/956463 (https://phabricator.wikimedia.org/T342621)
[15:55:56] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] superset: Move superset metrics to statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790) (owner: 10Andrea Denisse)
[15:56:10] <wikibugs>	 (03CR) 10Func: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956060 (https://phabricator.wikimedia.org/T346052) (owner: 10Func)
[15:57:17] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "looks good to me for the gitlab firewall config, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/956463 (https://phabricator.wikimedia.org/T342621) (owner: 10Arturo Borrero Gonzalez)
[15:57:23] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: remove ns-recursorX FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/956463 (https://phabricator.wikimedia.org/T342621) (owner: 10Arturo Borrero Gonzalez)
[15:59:35] <logmsgbot>	 !log brouberol@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1150.eqiad.wmnet
[16:00:39] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[16:00:52] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[16:01:03] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:01:33] <logmsgbot>	 !log brouberol@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1150.eqiad.wmnet
[16:03:03] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on netmon2002.wikimedia.org with reason: host reimage
[16:04:13] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1045.mgmt.eqiad.wmnet with reboot policy FORCED
[16:04:16] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1048.mgmt.eqiad.wmnet with reboot policy FORCED
[16:04:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10Eevans) >>! In T344259#9156545, @Eevans wrote: >>>! In T344259#9152935, @Eevans wrote: >>>>! In T344259#9152542, @Jclark-ctr wrote: >>> Replaced optic and cable again  @cm...
[16:04:18] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1053.mgmt.eqiad.wmnet with reboot policy FORCED
[16:04:21] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1046.mgmt.eqiad.wmnet with reboot policy FORCED
[16:04:21] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1049.mgmt.eqiad.wmnet with reboot policy FORCED
[16:04:23] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1051.mgmt.eqiad.wmnet with reboot policy FORCED
[16:04:24] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1052.mgmt.eqiad.wmnet with reboot policy FORCED
[16:04:26] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1050.mgmt.eqiad.wmnet with reboot policy FORCED
[16:04:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 (owner: 10Slyngshede)
[16:05:47] <logmsgbot>	 !log brouberol@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1151.eqiad.wmnet
[16:06:03] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:06:10] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netmon2002.wikimedia.org with reason: host reimage
[16:06:55] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1047.mgmt.eqiad.wmnet with reboot policy FORCED
[16:07:56] <logmsgbot>	 !log brouberol@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1151.eqiad.wmnet
[16:08:35] <logmsgbot>	 !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet2005-dev.codfw.wmnet with OS bookworm
[16:10:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jclark-ctr)
[16:10:31] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1030.eqiad.wmnet with OS bullseye
[16:10:39] <logmsgbot>	 !log brouberol@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1152.eqiad.wmnet
[16:11:02] <wikibugs>	 (03PS1) 10Tchanders: Enable partial action blocks on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956465 (https://phabricator.wikimedia.org/T339878)
[16:11:02] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye
[16:12:35] <wikibugs>	 (03PS1) 10Ssingh: dnsdist: update dnsdist conf version comment [puppet] - 10https://gerrit.wikimedia.org/r/956466
[16:12:49] <logmsgbot>	 !log brouberol@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1152.eqiad.wmnet
[16:13:15] <wikibugs>	 (03PS1) 10Tchanders: Enable partial action blocks on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956467 (https://phabricator.wikimedia.org/T332733)
[16:13:33] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43207/console" [puppet] - 10https://gerrit.wikimedia.org/r/956466 (owner: 10Ssingh)
[16:14:16] <wikibugs>	 (03PS2) 10Ssingh: dnsdist: update configuration file for version comment [puppet] - 10https://gerrit.wikimedia.org/r/956466
[16:14:46] <wikibugs>	 (03CR) 10Jforrester: composer: Install symfony/polyfill-php8x (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956022 (owner: 10Func)
[16:16:32] <logmsgbot>	 !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase1030.eqiad.wmnet with OS bullseye
[16:16:33] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:17:50] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1055.mgmt.eqiad.wmnet with reboot policy FORCED
[16:17:53] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1054.mgmt.eqiad.wmnet with reboot policy FORCED
[16:18:04] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1056.mgmt.eqiad.wmnet with reboot policy FORCED
[16:19:00] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye executed with errors: -...
[16:19:25] <icinga-wm>	 PROBLEM - Check systemd state on doc2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-host-data-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:20:08] <wikibugs>	 (03PS1) 10Majavah: icinga: add myself to ACLs [puppet] - 10https://gerrit.wikimedia.org/r/956470
[16:21:33] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:22:44] <wikibugs>	 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: sre.hosts.reimage: fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10fnegri) Ignore my previous comment, this turned out to be a one-off issue with the reimage cookbook. Restarting the cookbook a second time, it worked...
[16:25:39] <wikibugs>	 (03PS2) 10Func: composer: Install symfony/polyfill-php8x [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956022
[16:25:41] <wikibugs>	 (03PS5) 10Func: SiteConfiguration: Make sure the array is a list before appending [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956060 (https://phabricator.wikimedia.org/T346052)
[16:25:43] <wikibugs>	 (03PS1) 10Eevans: install: Use from-scratch partman recipe for restbase1030 [puppet] - 10https://gerrit.wikimedia.org/r/956471 (https://phabricator.wikimedia.org/T331713)
[16:26:07] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] install: Use from-scratch partman recipe for restbase1030 [puppet] - 10https://gerrit.wikimedia.org/r/956471 (https://phabricator.wikimedia.org/T331713) (owner: 10Eevans)
[16:26:34] <wikibugs>	 (03CR) 10Func: composer: Install symfony/polyfill-php8x (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956022 (owner: 10Func)
[16:28:30] <wikibugs>	 (03CR) 10Jforrester: [C: 04-1] "No, you can't use composer for this repo to change what code is available to run with, that's what I'm saying. This is now definitely-wron" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956022 (owner: 10Func)
[16:28:39] <logmsgbot>	 !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet2005-dev.codfw.wmnet with reason: host reimage
[16:31:18] <logmsgbot>	 !log denisse@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host netmon2002.wikimedia.org with OS bookworm
[16:31:33] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:31:47] <wikibugs>	 (03CR) 10Func: composer: Install symfony/polyfill-php8x (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956022 (owner: 10Func)
[16:32:29] <logmsgbot>	 !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet2005-dev.codfw.wmnet with reason: host reimage
[16:32:44] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] install: Use from-scratch partman recipe for restbase1030 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956471 (https://phabricator.wikimedia.org/T331713) (owner: 10Eevans)
[16:32:52] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: machinetranslation: Add egress mesh template [deployment-charts] - 10https://gerrit.wikimedia.org/r/956444 (https://phabricator.wikimedia.org/T335491)
[16:33:00] <wikibugs>	 (03CR) 10Jforrester: "This is fine for test code, but won't work if someone copies to production code (which doesn't have any composer auto-loaded stuff, it run" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956060 (https://phabricator.wikimedia.org/T346052) (owner: 10Func)
[16:33:50] <wikibugs>	 (03CR) 10Jforrester: [C: 04-1] composer: Install symfony/polyfill-php8x (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956022 (owner: 10Func)
[16:34:12] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] install: Use from-scratch partman recipe for restbase1030 [puppet] - 10https://gerrit.wikimedia.org/r/956471 (https://phabricator.wikimedia.org/T331713) (owner: 10Eevans)
[16:41:02] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1030.eqiad.wmnet with OS bullseye
[16:41:16] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye
[16:42:07] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] icinga: add myself to ACLs [puppet] - 10https://gerrit.wikimedia.org/r/956470 (owner: 10Majavah)
[16:42:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] icinga: add myself to ACLs [puppet] - 10https://gerrit.wikimedia.org/r/956470 (owner: 10Majavah)
[16:42:30] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2096.codfw.wmnet with reason: Maintenance
[16:42:43] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2096.codfw.wmnet with reason: Maintenance
[16:42:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2096 (T337310)', diff saved to https://phabricator.wikimedia.org/P52423 and previous config saved to /var/cache/conftool/dbconfig/20230911-164249-ladsgroup.json
[16:43:57] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] icinga: add myself to ACLs [puppet] - 10https://gerrit.wikimedia.org/r/956470 (owner: 10Majavah)
[16:45:16] <wikibugs>	 (03PS2) 10Ebernhardson: flink-app: Allow declaring zookeeper clusters by name [deployment-charts] - 10https://gerrit.wikimedia.org/r/955033
[16:46:17] <stashbot>	 T337310: Remove notification_seen column from echo_notifications database table - https://phabricator.wikimedia.org/T337310
[16:47:17] <wikibugs>	 (03PS9) 10Ebernhardson: Draft: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960
[16:48:21] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1054.mgmt.eqiad.wmnet with reboot policy FORCED
[16:48:23] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1056.mgmt.eqiad.wmnet with reboot policy FORCED
[16:48:27] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1055.mgmt.eqiad.wmnet with reboot policy FORCED
[16:49:21] <wikibugs>	 (03PS1) 10Bking: rdf-streaming-updater-k8s: Add egress rules to values [deployment-charts] - 10https://gerrit.wikimedia.org/r/956474 (https://phabricator.wikimedia.org/T346048)
[16:50:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jclark-ctr)
[16:52:28] <wikibugs>	 (03Abandoned) 10Func: composer: Install symfony/polyfill-php8x [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956022 (owner: 10Func)
[16:57:14] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1027.mgmt.eqiad.wmnet with reboot policy FORCED
[16:58:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2096 (T337310)', diff saved to https://phabricator.wikimedia.org/P52424 and previous config saved to /var/cache/conftool/dbconfig/20230911-165802-ladsgroup.json
[16:58:07] <stashbot>	 T337310: Remove notification_seen column from echo_notifications database table - https://phabricator.wikimedia.org/T337310
[16:59:34] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1029.mgmt.eqiad.wmnet with reboot policy FORCED
[16:59:43] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes1027.mgmt.eqiad.wmnet with reboot policy FORCED
[16:59:45] <wikibugs>	 (03PS2) 10Bking: rdf-streaming-updater-k8s: Add egress rules to values [deployment-charts] - 10https://gerrit.wikimedia.org/r/956474 (https://phabricator.wikimedia.org/T346048)
[16:59:57] <wikibugs>	 (03PS3) 10Bking: rdf-streaming-updater-k8s: Add egress rules to values [deployment-charts] - 10https://gerrit.wikimedia.org/r/956474 (https://phabricator.wikimedia.org/T346048)
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230911T1700)
[17:00:05] <jouncebot>	 ryankemper: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230911T1700).
[17:00:48] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[17:02:56] <wikibugs>	 (03PS6) 10Func: SiteConfiguration: Make sure the array is a list before appending [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956060 (https://phabricator.wikimedia.org/T346052)
[17:04:33] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) firing: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[17:06:41] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1030.eqiad.wmnet with reason: host reimage
[17:09:44] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1030.eqiad.wmnet with reason: host reimage
[17:11:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/956454 (https://phabricator.wikimedia.org/T307021) (owner: 10Ayounsi)
[17:12:26] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[17:13:02] <icinga-wm>	 RECOVERY - Check systemd state on doc2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:13:05] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/956461 (owner: 10Muehlenhoff)
[17:13:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2096', diff saved to https://phabricator.wikimedia.org/P52425 and previous config saved to /var/cache/conftool/dbconfig/20230911-171309-ladsgroup.json
[17:15:49] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] SiteConfiguration: Make sure the array is a list before appending [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956060 (https://phabricator.wikimedia.org/T346052) (owner: 10Func)
[17:24:33] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) resolved: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[17:24:46] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to RESOURCE for VRiley - https://phabricator.wikimedia.org/T346077 (10VRiley-WMF)
[17:25:28] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to RESOURCE for VRiley - https://phabricator.wikimedia.org/T346077 (10RobH)
[17:25:45] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to sehll/dcops for VRiley - https://phabricator.wikimedia.org/T346077 (10RobH) p:05Triage→03Medium
[17:28:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2096', diff saved to https://phabricator.wikimedia.org/P52426 and previous config saved to /var/cache/conftool/dbconfig/20230911-172815-ladsgroup.json
[17:28:57] <wikibugs>	 (03PS1) 10RobH: adding valarie to dc ops shell group [puppet] - 10https://gerrit.wikimedia.org/r/956479 (https://phabricator.wikimedia.org/T346077)
[17:29:11] <wikibugs>	 (03CR) 10Daniel Kinzler: Map Jade content handler to UnknownContentHandler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955819 (https://phabricator.wikimedia.org/T345874) (owner: 10Milimetric)
[17:31:35] <wikibugs>	 (03CR) 10RobH: [C: 03+2] adding valarie to dc ops shell group [puppet] - 10https://gerrit.wikimedia.org/r/956479 (https://phabricator.wikimedia.org/T346077) (owner: 10RobH)
[17:31:46] <wikibugs>	 (03PS7) 10Ebernhardson: Add a networkpolicy template for zookeeper [deployment-charts] - 10https://gerrit.wikimedia.org/r/955032
[17:31:48] <wikibugs>	 (03PS3) 10Ebernhardson: flink-app: Allow declaring zookeeper clusters by name [deployment-charts] - 10https://gerrit.wikimedia.org/r/955033
[17:31:50] <wikibugs>	 (03PS10) 10Ebernhardson: Draft: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960
[17:31:54] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to shell/dcops for VRiley - https://phabricator.wikimedia.org/T346077 (10Aklapper)
[17:37:03] <wikibugs>	 (03PS3) 10Ladsgroup: Map Jade content handler to UnknownContentHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955819 (https://phabricator.wikimedia.org/T345874) (owner: 10Milimetric)
[17:37:10] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Map Jade content handler to UnknownContentHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955819 (https://phabricator.wikimedia.org/T345874) (owner: 10Milimetric)
[17:38:00] <wikibugs>	 (03Merged) 10jenkins-bot: Map Jade content handler to UnknownContentHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955819 (https://phabricator.wikimedia.org/T345874) (owner: 10Milimetric)
[17:41:01] <wikibugs>	 (03CR) 10Daniel Kinzler: "Uh, hold on... it was renamed to FallbackContentHandler in 1.34 I think?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955819 (https://phabricator.wikimedia.org/T345874) (owner: 10Milimetric)
[17:43:01] <wikibugs>	 (03PS9) 10Herron: profile::mediawiki::common: include prometheus statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/952894 (https://phabricator.wikimedia.org/T345377)
[17:43:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:43:09] <wikibugs>	 (03PS10) 10Herron: profile::mediawiki::common: set default histogram buckets [puppet] - 10https://gerrit.wikimedia.org/r/954114 (https://phabricator.wikimedia.org/T344751)
[17:43:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2096 (T337310)', diff saved to https://phabricator.wikimedia.org/P52427 and previous config saved to /var/cache/conftool/dbconfig/20230911-174321-ladsgroup.json
[17:43:24] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance
[17:43:25] <stashbot>	 T337310: Remove notification_seen column from echo_notifications database table - https://phabricator.wikimedia.org/T337310
[17:43:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance
[17:45:15] <wikibugs>	 (03CR) 10Jbond: "Thanks both for the review on this and sorry its taken me so long to pick it up.  however would be good to try and get something this week" [puppet] - 10https://gerrit.wikimedia.org/r/936763 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond)
[17:45:24] <wikibugs>	 (03PS4) 10Jbond: rsyslog: update to use pki certificates [puppet] - 10https://gerrit.wikimedia.org/r/936763 (https://phabricator.wikimedia.org/T340741)
[17:45:26] <wikibugs>	 (03PS1) 10Jbond: rsyslog: switch the endpoints to use the PKI system [puppet] - 10https://gerrit.wikimedia.org/r/956481 (https://phabricator.wikimedia.org/T340741)
[17:46:22] <wikibugs>	 (03CR) 10Jforrester: "This should probably go in the WikimediaMessages extension, which is what we do for undeployed extensions' messages (see https://gerrit.wi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955819 (https://phabricator.wikimedia.org/T345874) (owner: 10Milimetric)
[17:47:12] <wikibugs>	 (03PS1) 10Ladsgroup: Use FallbackContentHandler instead of FakeContentHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956482
[17:48:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:48:52] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Map Jade content handler to UnknownContentHandler (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955819 (https://phabricator.wikimedia.org/T345874) (owner: 10Milimetric)
[17:50:15] <wikibugs>	 (03PS2) 10Ladsgroup: Use FallbackContentHandler instead of UnknownContentHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956482 (https://phabricator.wikimedia.org/T345874)
[17:51:50] <wikibugs>	 (03CR) 10Jforrester: Map Jade content handler to UnknownContentHandler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955819 (https://phabricator.wikimedia.org/T345874) (owner: 10Milimetric)
[17:53:08] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.reimage for host netmon2002.wikimedia.org with OS bullseye
[17:58:00] <wikibugs>	 (03CR) 10Xcollazo: "If it helps lower the burden here, I think we could drop the Spark 3.2 build, here or elsewhere (Gitlab?). No one depends on it, and 3.3.X" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/956374 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[17:58:36] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - eevans@cumin1001"
[17:59:59] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - eevans@cumin1001"
[18:00:00] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1030.eqiad.wmnet with OS bullseye
[18:00:07] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye completed: - restbase10...
[18:00:33] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:08:52] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on netmon2002.wikimedia.org with reason: host reimage
[18:11:08] <wikibugs>	 (03PS1) 10Ssingh: Release 9.2.1-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/956484 (https://phabricator.wikimedia.org/T342154)
[18:11:54] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netmon2002.wikimedia.org with reason: host reimage
[18:13:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2101.codfw.wmnet with reason: Maintenance
[18:13:29] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2101.codfw.wmnet with reason: Maintenance
[18:15:00] <wikibugs>	 (03CR) 10Ssingh: "In the logging patch, note that we are using the Warning and Error macros, but 9.2.1 is using SiteThrottledWarning and SiteThrottledError " [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/956484 (https://phabricator.wikimedia.org/T342154) (owner: 10Ssingh)
[18:20:10] <wikibugs>	 (03CR) 10Ebernhardson: rdf-streaming-updater-k8s: Add egress rules to values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/956474 (https://phabricator.wikimedia.org/T346048) (owner: 10Bking)
[18:25:44] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "Caused T346080?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955819 (https://phabricator.wikimedia.org/T345874) (owner: 10Milimetric)
[18:27:17] <wikibugs>	 (03PS3) 10Bartosz Dziewoński: Use FallbackContentHandler instead of UnknownContentHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956482 (https://phabricator.wikimedia.org/T345874) (owner: 10Ladsgroup)
[18:27:31] <wikibugs>	 (03CR) 10STran: [C: 03+1] Enable partial action blocks on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956465 (https://phabricator.wikimedia.org/T339878) (owner: 10Tchanders)
[18:27:37] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] "I suggest enabling in beta cluster on its own first, maybe for a week or two, and instruct enwiktionary community to test (and demonstrate" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668156 (https://phabricator.wikimedia.org/T61245) (owner: 10TTO)
[18:27:48] <wikibugs>	 (03CR) 10STran: [C: 03+1] Enable partial action blocks on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956467 (https://phabricator.wikimedia.org/T332733) (owner: 10Tchanders)
[18:28:31] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Use FallbackContentHandler instead of UnknownContentHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956482 (https://phabricator.wikimedia.org/T345874) (owner: 10Ladsgroup)
[18:29:03] <wikibugs>	 (03Merged) 10jenkins-bot: Use FallbackContentHandler instead of UnknownContentHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956482 (https://phabricator.wikimedia.org/T345874) (owner: 10Ladsgroup)
[18:33:18] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netmon2002.wikimedia.org with OS bullseye
[18:35:33] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:37:51] <wikibugs>	 (03PS1) 10Brennen Bearnes: phabricator deployment: restart php when finalizing deploy [puppet] - 10https://gerrit.wikimedia.org/r/956486 (https://phabricator.wikimedia.org/T314460)
[18:42:12] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2131.codfw.wmnet with reason: Maintenance
[18:42:25] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2131.codfw.wmnet with reason: Maintenance
[18:42:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2131 (T337310)', diff saved to https://phabricator.wikimedia.org/P52428 and previous config saved to /var/cache/conftool/dbconfig/20230911-184231-ladsgroup.json
[18:42:35] <stashbot>	 T337310: Remove notification_seen column from echo_notifications database table - https://phabricator.wikimedia.org/T337310
[18:43:19] <wikibugs>	 (03PS2) 10Jbond: puppet: Add new PuppetServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739
[18:47:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppet: Add new PuppetServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 (owner: 10Jbond)
[18:57:38] <wikibugs>	 (03CR) 10Lucas Werkmeister: [C: 03+1] Add lucaswerkmeister.de to Planet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948203 (owner: 10Amire80)
[18:58:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131 (T337310)', diff saved to https://phabricator.wikimedia.org/P52429 and previous config saved to /var/cache/conftool/dbconfig/20230911-185813-ladsgroup.json
[18:58:18] <stashbot>	 T337310: Remove notification_seen column from echo_notifications database table - https://phabricator.wikimedia.org/T337310
[19:03:02] <wikibugs>	 (03PS5) 10Srishakatux: Add Akan language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955007 (https://phabricator.wikimedia.org/T333765)
[19:09:32] <icinga-wm>	 PROBLEM - cassandra-b service on restbase1030 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:09:34] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1030 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:09:38] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.234:9042 on restbase1030 is CRITICAL: connect to address 10.64.48.234 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[19:09:54] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.48.236:7000 on restbase1030 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[19:10:12] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.48.236:9042 on restbase1030 is CRITICAL: connect to address 10.64.48.236 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[19:10:16] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.48.235:7000 on restbase1030 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[19:10:26] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.48.235:9042 on restbase1030 is CRITICAL: connect to address 10.64.48.235 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[19:13:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131', diff saved to https://phabricator.wikimedia.org/P52430 and previous config saved to /var/cache/conftool/dbconfig/20230911-191320-ladsgroup.json
[19:14:11] <urandom>	 Got those ^^^
[19:18:20] <wikibugs>	 (03PS3) 10Jbond: puppet: Add new PuppetServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739
[19:28:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131', diff saved to https://phabricator.wikimedia.org/P52431 and previous config saved to /var/cache/conftool/dbconfig/20230911-192826-ladsgroup.json
[19:31:25] <wikibugs>	 (03PS1) 10Jforrester: [mathoid] Switch image to GitLab-published one [deployment-charts] - 10https://gerrit.wikimedia.org/r/956492 (https://phabricator.wikimedia.org/T344747)
[19:38:49] <wikibugs>	 (03CR) 10Jbond: "thanks updated" [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 (owner: 10Jbond)
[19:43:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131 (T337310)', diff saved to https://phabricator.wikimedia.org/P52432 and previous config saved to /var/cache/conftool/dbconfig/20230911-194332-ladsgroup.json
[19:43:37] <stashbot>	 T337310: Remove notification_seen column from echo_notifications database table - https://phabricator.wikimedia.org/T337310
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230911T2000).
[20:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[20:00:13] <wikibugs>	 (03PS1) 10Andrew Bogott: nova-fullstack: check dns via auth server rather than recursor [puppet] - 10https://gerrit.wikimedia.org/r/956497 (https://phabricator.wikimedia.org/T346092)
[20:01:25] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmf_sink: don't assume project_name == project_id [puppet] - 10https://gerrit.wikimedia.org/r/955973 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott)
[20:01:35] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova_fullstack_test: don't assume project_name == project_id [puppet] - 10https://gerrit.wikimedia.org/r/956088 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott)
[20:02:27] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova-fullstack: check dns via auth server rather than recursor [puppet] - 10https://gerrit.wikimedia.org/r/956497 (https://phabricator.wikimedia.org/T346092) (owner: 10Andrew Bogott)
[20:03:11] <urbanecm>	 good, nothing to do! :)
[20:05:06] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[20:09:11] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt titan1001  - jclark@cumin1001"
[20:09:58] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt titan1001  - jclark@cumin1001"
[20:09:58] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:10:03] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[20:12:13] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt titan1001  - jclark@cumin1001"
[20:12:39] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[20:13:00] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt titan1001  - jclark@cumin1001"
[20:13:00] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:13:25] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host titan1001
[20:13:32] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host titan1001
[20:13:35] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host titan1002
[20:14:11] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes1029.mgmt.eqiad.wmnet with reboot policy FORCED
[20:17:02] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host titan1002
[20:17:10] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host titan1001
[20:17:12] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host titan1001
[20:18:48] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host titan1001.mgmt.eqiad.wmnet with reboot policy FORCED
[20:18:50] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host titan1002.mgmt.eqiad.wmnet with reboot policy FORCED
[20:26:51] <wikibugs>	 (03CR) 10Jeena Huneidi: "Well, it is used by this repository: https://gerrit.wikimedia.org/g/releng/local-charts, but I don't think that repo is used by anyone. We" [deployment-charts] - 10https://gerrit.wikimedia.org/r/953633 (owner: 10Jforrester)
[20:37:23] <wikibugs>	 (03PS1) 10Jdlrobson: WIP: Logos for special projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956502 (https://phabricator.wikimedia.org/T341242)
[20:39:15] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] update_version: tox.ini: whitelist_externals -> allowlist_externals [deployment-charts] - 10https://gerrit.wikimedia.org/r/955875 (https://phabricator.wikimedia.org/T345695) (owner: 10Hashar)
[20:39:46] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host titan1001.mgmt.eqiad.wmnet with reboot policy FORCED
[20:39:58] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['titan1001.eqiad.wmnet']
[20:41:18] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['titan1001.eqiad.wmnet']
[20:41:25] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['titan1001.eqiad.wmnet']
[20:41:39] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['titan1001.eqiad.wmnet']
[20:41:47] <wikibugs>	 (03PS2) 10Jdlrobson: WIP: Logos for special projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956502 (https://phabricator.wikimedia.org/T341242)
[20:41:53] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['titan1001.eqiad.wmnet']
[20:41:59] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['titan1001.eqiad.wmnet']
[20:43:30] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[20:48:10] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host titan1002.mgmt.eqiad.wmnet with reboot policy FORCED
[20:49:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10Jclark-ctr)
[20:51:56] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[21:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: #bothumor I � Unicode. All rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230911T2100).
[21:01:08] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is CRITICAL: 1.007e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[21:04:33] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) firing: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[21:04:58] <sbassett>	 Hey all - I’ve got one, maybe two patches for today’s security window.
[21:10:02] <wikibugs>	 (03PS1) 10Jclark-ctr: Add titan100[1-2} site.pp [puppet] - 10https://gerrit.wikimedia.org/r/956506 (https://phabricator.wikimedia.org/T342179)
[21:11:27] <wikibugs>	 (03CR) 10Jclark-ctr: [C: 03+2] Add titan100[1-2} site.pp [puppet] - 10https://gerrit.wikimedia.org/r/956506 (https://phabricator.wikimedia.org/T342179) (owner: 10Jclark-ctr)
[21:19:40] <sbassett>	 !log Deployed security fix for T345693
[21:19:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:23:10] <wikibugs>	 (03CR) 10RobH: [C: 03+2] Add titan100[1-2} site.pp [puppet] - 10https://gerrit.wikimedia.org/r/956506 (https://phabricator.wikimedia.org/T342179) (owner: 10Jclark-ctr)
[21:24:13] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] aptrepo: amend pin to allow grafana 9.4.x [puppet] - 10https://gerrit.wikimedia.org/r/955014 (https://phabricator.wikimedia.org/T345362) (owner: 10Cwhite)
[21:24:33] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) resolved: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[21:32:38] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host titan1001.eqiad.wmnet with OS bookworm
[21:32:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host titan1001.eqiad.wmnet with OS bookworm
[21:32:46] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host titan1002.eqiad.wmnet with OS bookworm
[21:32:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host titan1002.eqiad.wmnet with OS bookworm
[21:33:19] <cwhite>	 !log update grafana to 9.4.14 on grafana1002 T345362
[21:33:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:33:22] <stashbot>	 T345362: DatasourceError grafana alerting error message database is locked - https://phabricator.wikimedia.org/T345362
[21:36:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10Jclark-ctr)
[21:37:48] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE, 10Patch-For-Review: Requesting Creation of a new POSIX group and system user for the Analytics WMDE team. - https://phabricator.wikimedia.org/T345726 (10RLazarus) a:03joanna_borun Hi @joanna_borun -- does this need Infrastructure Foundations approval?
[21:43:08] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[21:47:40] <wikibugs>	 (03PS1) 10Eevans: Revert "install: Use from-scratch partman recipe for restbase1030" [puppet] - 10https://gerrit.wikimedia.org/r/956063
[21:51:44] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[21:55:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10Eevans) 05Open→03Resolved >>! In T344259#9157174, @Eevans wrote: >>>! In T344259#9156545, @Eevans wrote: >> [ ... ] >> @Jclark-ctr could we try connecting something el...
[22:03:28] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ahoelzl - https://phabricator.wikimedia.org/T345959 (10RLazarus) Hi @Ahoelzl, welcome to the Foundation! SRE here, I'll be able to set you up with production access.  The SSH key you provided is the same one you're already using...
[22:03:33] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ahoelzl - https://phabricator.wikimedia.org/T345959 (10RLazarus)
[22:25:36] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T343198)', diff saved to https://phabricator.wikimedia.org/P52434 and previous config saved to /var/cache/conftool/dbconfig/20230911-222536-arnaudb.json
[22:25:40] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[22:25:44] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] envoyproxy: tox.ini: whitelist_externals -> allowlist_externals [puppet] - 10https://gerrit.wikimedia.org/r/955876 (https://phabricator.wikimedia.org/T345695) (owner: 10Hashar)
[22:40:43] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P52435 and previous config saved to /var/cache/conftool/dbconfig/20230911-224042-arnaudb.json
[22:42:26] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[22:52:53] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host titan1001.eqiad.wmnet with OS bookworm
[22:52:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host titan1001.eqiad.wmnet with OS bookworm executed with errors:...
[22:53:01] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host titan1002.eqiad.wmnet with OS bookworm
[22:53:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host titan1002.eqiad.wmnet with OS bookworm executed with errors:...
[22:55:49] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P52436 and previous config saved to /var/cache/conftool/dbconfig/20230911-225548-arnaudb.json
[23:02:26] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] netmon: Failover from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/956452 (https://phabricator.wikimedia.org/T344136) (owner: 10Andrea Denisse)
[23:07:43] <wikibugs>	 (03CR) 10Cwhite: [V: 03+1] "Tests ok on grafana2001.  Ready for deploy." [puppet] - 10https://gerrit.wikimedia.org/r/951882 (https://phabricator.wikimedia.org/T288196) (owner: 10Cwhite)
[23:10:55] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T343198)', diff saved to https://phabricator.wikimedia.org/P52437 and previous config saved to /var/cache/conftool/dbconfig/20230911-231054-arnaudb.json
[23:10:57] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance
[23:11:00] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[23:11:10] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance
[23:11:12] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[23:11:25] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[23:11:32] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T343198)', diff saved to https://phabricator.wikimedia.org/P52438 and previous config saved to /var/cache/conftool/dbconfig/20230911-231131-arnaudb.json
[23:11:48] <wikibugs>	 (03PS1) 10Dduvall: gitlab: Fix conditional end in gitlab.rb template [puppet] - 10https://gerrit.wikimedia.org/r/956515
[23:14:05] <wikibugs>	 (03CR) 10Dduvall: "FYI I noticed this bug while trying to test T337570 in devtools." [puppet] - 10https://gerrit.wikimedia.org/r/956515 (owner: 10Dduvall)
[23:31:36] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T343198)', diff saved to https://phabricator.wikimedia.org/P52439 and previous config saved to /var/cache/conftool/dbconfig/20230911-233135-arnaudb.json
[23:31:39] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[23:46:42] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P52440 and previous config saved to /var/cache/conftool/dbconfig/20230911-234641-arnaudb.json