[00:39:15] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/941907 [00:39:21] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/941907 (owner: 10TrainBranchBot) [00:54:50] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/941907 (owner: 10TrainBranchBot) [01:22:30] (Primary outbound port utilisation over 80% #page) firing: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [01:22:30] (Primary outbound port utilisation over 80% #page) firing: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [01:27:30] (Primary outbound port utilisation over 80% #page) resolved: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [01:27:30] (Primary outbound port utilisation over 80% #page) resolved: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:06:33] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:18:28] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:19:00] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:20] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:50] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:58] PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.198 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:31:33] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:46:03] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:51:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:41:28] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - No response from remote host 91.198.174.244 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:50:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:55:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:14:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:19:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:40:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:45:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:04:28] 10sre-alert-triage, 10serviceops: Alert triage: overdue critical alert - https://phabricator.wikimedia.org/T342756 (10Joe) 05Open→03Resolved a:03Joe Yes, we forgot to run `systemctl reset-failed` on mwmaint2002. [05:04:40] RECOVERY - Check systemd state on mwmaint2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:07:58] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:14:27] 10SRE, 10Wikimedia-Mailing-lists: Create research-engineering-alerts list - https://phabricator.wikimedia.org/T342833 (10Legoktm) Do you want archives? Some -alerts lists keep archives and others don't. Is the contents of the alerts public or does it need to be private? [05:22:48] !log oblivian@deploy1002 Started scap: (no justification provided) [05:26:37] <_joe_> !log scap is not syncing; just rebuilding the image from scratch to verify the reason for a bug. [05:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:47] !log oblivian@deploy1002 Started scap: (no justification provided) [05:31:58] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:32:34] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:38:44] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:40:58] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [05:45:29] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [05:50:18] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.487 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:50:26] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:51:02] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50276 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:57:37] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230727T0600) [06:00:05] kormat, marostegui, and Amir1: That opportune time is upon us again. Time for a Primary database switchover deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230727T0600). [06:03:39] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [06:32:39] (03PS1) 10Ilias Sarantopoulos: ml-services: add lw request limit for oes-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/942060 (https://phabricator.wikimedia.org/T342789) [06:33:09] (03PS2) 10Ilias Sarantopoulos: ml-services: add lw request limit for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/942060 (https://phabricator.wikimedia.org/T342789) [06:34:08] (03CR) 10Elukey: [C: 03+1] ml-services: add lw request limit for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/942060 (https://phabricator.wikimedia.org/T342789) (owner: 10Ilias Sarantopoulos) [06:34:21] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: add lw request limit for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/942060 (https://phabricator.wikimedia.org/T342789) (owner: 10Ilias Sarantopoulos) [06:35:01] (03Merged) 10jenkins-bot: ml-services: add lw request limit for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/942060 (https://phabricator.wikimedia.org/T342789) (owner: 10Ilias Sarantopoulos) [06:36:39] !log isaranto@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [06:38:08] !log isaranto@deploy1002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [06:39:00] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [06:40:30] (03PS1) 10Elukey: role::kafka::main: raise num.io.threads to 8 [puppet] - 10https://gerrit.wikimedia.org/r/942061 (https://phabricator.wikimedia.org/T341558) [06:41:30] (03PS1) 10Alexandros Kosiaris: .gitignore: Add .venv [deployment-charts] - 10https://gerrit.wikimedia.org/r/942062 [06:41:53] (03CR) 10Alexandros Kosiaris: [C: 03+1] role::kafka::main: raise num.io.threads to 8 [puppet] - 10https://gerrit.wikimedia.org/r/942061 (https://phabricator.wikimedia.org/T341558) (owner: 10Elukey) [06:44:38] (03CR) 10Alexandros Kosiaris: [C: 03+2] .gitignore: Add .venv [deployment-charts] - 10https://gerrit.wikimedia.org/r/942062 (owner: 10Alexandros Kosiaris) [06:45:02] (03Merged) 10jenkins-bot: .gitignore: Add .venv [deployment-charts] - 10https://gerrit.wikimedia.org/r/942062 (owner: 10Alexandros Kosiaris) [06:46:40] (03PS2) 10Elukey: role::kafka::main: raise num.io.threads to 8 [puppet] - 10https://gerrit.wikimedia.org/r/942061 (https://phabricator.wikimedia.org/T341558) [06:48:00] (03CR) 10Elukey: [C: 03+2] role::kafka::main: raise num.io.threads to 8 [puppet] - 10https://gerrit.wikimedia.org/r/942061 (https://phabricator.wikimedia.org/T341558) (owner: 10Elukey) [06:53:36] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-main-eqiad cluster: Roll restart of jvm daemons. [06:56:24] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:56:50] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:57:46] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/941961 (https://phabricator.wikimedia.org/T342016) (owner: 10EoghanGaffney) [07:00:05] Amir1, apergos, and jnuche: (Dis)respected human, time to deploy UTC morning backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230727T0700). Please do the needful. [07:00:29] morning! [07:00:43] we have no patches schduled for deployment but there is one person signed up for the training [07:01:27] as such, if a self-deployer has a patch to sneak in at the last minute, and is willing to go nice and slow so I can explain all the steps, and what's going on under the hood and such, now's the time to claim that spot! [07:35:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:38:51] hello jennifer_ebe [07:39:22] (please ignore these pings) [07:39:25] (10:00:05 πμ) jouncebot: Amir1, apergos, and jnuche: (Dis)respected human, time to deploy UTC morning backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230727T0700). Please do the needful. [07:40:04] !log reboot lsw1-a1-codfw (test device) [07:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:41:02] (03PS1) 10Giuseppe Lavagetto: mw-misc: configure ingress for noc.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/942064 (https://phabricator.wikimedia.org/T341859) [07:43:55] (03CR) 10JMeybohm: [C: 03+1] mw-misc: configure ingress for noc.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/942064 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [07:46:26] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mw-misc: configure ingress for noc.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/942064 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [07:47:30] (03Merged) 10jenkins-bot: mw-misc: configure ingress for noc.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/942064 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [07:49:16] jouncebot: now [07:49:16] For the next 0 hour(s) and 10 minute(s): UTC morning backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230727T0700) [07:49:39] apergos: is backport still ongoing? [07:50:04] give me ten more minutes to talk talk talk and then I'm out. [07:50:12] joe: [07:50:34] ack :) [07:51:02] apergos: are you deploying with scap right now? [07:51:06] nope [07:51:14] no patches to go, so... :-) [07:53:54] oh ok [07:54:03] then I'll just deploy noc-to-k8s [07:54:05] :) [07:54:14] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [07:54:24] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [07:54:34] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [07:54:39] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [07:56:11] all you [08:00:04] jnuche and dancy: May I have your attention please! MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230727T0800) [08:00:29] morning [08:01:04] joe: are you finished with those deploys? [08:01:30] jnuche: only if you promise to review a patch to releng/release later :D [08:01:39] but yes I was done [08:01:53] hehehe, sure, just ping me on that patch :) [08:02:06] thx, deploying the train in a couple mins [08:05:16] (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942317 (https://phabricator.wikimedia.org/T340247) [08:05:18] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942317 (https://phabricator.wikimedia.org/T340247) (owner: 10TrainBranchBot) [08:06:01] (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942317 (https://phabricator.wikimedia.org/T340247) (owner: 10TrainBranchBot) [08:12:22] (03CR) 10EoghanGaffney: [C: 03+2] releases: Change owner on /srv/patches after rsync [puppet] - 10https://gerrit.wikimedia.org/r/941961 (https://phabricator.wikimedia.org/T342016) (owner: 10EoghanGaffney) [08:15:59] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.19 refs T340247 [08:16:03] T340247: 1.41.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T340247 [08:20:10] (03PS16) 10Slyngshede: C:bigtop::hadoop move net-topology.py to files. [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) [08:24:17] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42709/console" [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [08:31:06] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42710/console" [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [08:34:42] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-main-eqiad cluster: Roll restart of jvm daemons. [08:39:23] (03CR) 10Slyngshede: [V: 03+1] C:bigtop::hadoop move net-topology.py to files. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [08:42:49] (03PS1) 10Urbanecm: mailmap: Add mapping for my addresses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942321 [08:42:53] !log begin restarting lvs1019 (T335835) [08:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:36] (03PS2) 10Vgutierrez: varnish: add requestctl to X-analytics for static actions too [puppet] - 10https://gerrit.wikimedia.org/r/941448 (https://phabricator.wikimedia.org/T342577) (owner: 10Giuseppe Lavagetto) [08:43:38] 10sre-alert-triage, 10Data-Platform-SRE: Alert triage: overdue warning alert - https://phabricator.wikimedia.org/T342762 (10dcausse) There was a stale `/srv/query_service/aliases.map` file with some content in it (that I copied to `/root/aliases.map.T342762`) which I believe was confusing nginx causing it to r... [08:44:17] (03CR) 10Vgutierrez: "as discussed on IRC, s/-/_/g on the rule names to follow the de facto standard used in requestctl" [puppet] - 10https://gerrit.wikimedia.org/r/941448 (https://phabricator.wikimedia.org/T342577) (owner: 10Giuseppe Lavagetto) [08:45:39] (03PS2) 10Urbanecm: mailmap: Add mapping for urbanecm's addresses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942321 [08:47:08] PROBLEM - pybal on lvs1019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [08:47:28] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [08:49:58] PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=81) https://wikitech.wikimedia.org/wiki/PyBal [08:52:58] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:53:34] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 2/4 UP : OSPFv3: 2/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:53:34] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:53:34] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:53:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:58:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:07:50] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs1019.eqiad.wmnet [09:11:05] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1019.eqiad.wmnet [09:12:12] RECOVERY - pybal on lvs1019 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [09:12:15] !log done restarting lvs1019 (T335835) [09:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:30] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:14:21] 10SRE: Consider making a variant of the fatalmonitor CLI tool that ignores appserver timeouts - https://phabricator.wikimedia.org/T213777 (10LSobanski) 05Open→03Declined Since the linked comment is: "(fatalmonitor) No longer exists - see logspam-watch instead" I would consider this as a won't fix. [09:17:16] RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 81 connections established with conf1007.eqiad.wmnet:4001 (min=81) https://wikitech.wikimedia.org/wiki/PyBal [09:19:23] (03PS1) 10EoghanGaffney: releases: Change ownership of /srv/patches after chown [puppet] - 10https://gerrit.wikimedia.org/r/942372 (https://phabricator.wikimedia.org/T342016) [09:20:27] !log Run `mwscript extensions/GrowthExperiments/maintenance/refreshLinkRecommendations.php --wiki=frwiki --page="Sensibilité électromagnétique" --force` to debug T342488 [09:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:32] T342488: Add-link suggested edits doesn't show Captcha to user and are blocked - https://phabricator.wikimedia.org/T342488 [09:21:49] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42711/console" [puppet] - 10https://gerrit.wikimedia.org/r/942372 (https://phabricator.wikimedia.org/T342016) (owner: 10EoghanGaffney) [09:28:01] (03PS3) 10Jbond: (WIP) puppetdb-microservice: update puppetdb micro service so it streams data [puppet] - 10https://gerrit.wikimedia.org/r/940403 (https://phabricator.wikimedia.org/T342458) [09:30:10] (03CR) 10Jelto: "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/942372 (https://phabricator.wikimedia.org/T342016) (owner: 10EoghanGaffney) [09:34:56] (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] releases: Change ownership of /srv/patches after chown [puppet] - 10https://gerrit.wikimedia.org/r/942372 (https://phabricator.wikimedia.org/T342016) (owner: 10EoghanGaffney) [09:38:54] !log begin restarting lvs3007 (T335835) [09:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:09] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs3007.esams.wmnet [09:40:28] (03PS1) 10Jcrespo: dbbackups: Upgrade dbprov1004 to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/942376 (https://phabricator.wikimedia.org/T334650) [09:41:02] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:42:48] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs3007.esams.wmnet [09:44:10] !log done restarting lvs3007 (T335835) [09:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:01] (03PS2) 10Jcrespo: dbbackups: Upgrade dbprov1004,2004 to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/942376 (https://phabricator.wikimedia.org/T334650) [09:52:43] (03PS3) 10Jcrespo: dbbackups: Upgrade dbprov1004,2004 to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/942376 (https://phabricator.wikimedia.org/T334650) [09:54:15] !log begin restarting lvs3005 (T335835) [09:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:48] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:56:17] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-main-codfw cluster: Roll restart of jvm daemons. [09:58:08] PROBLEM - PyBal connections to etcd on lvs3005 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [09:58:50] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:59:14] PROBLEM - PyBal backends health check on lvs3005 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [09:59:18] PROBLEM - pybal on lvs3005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [10:00:05] mvolz: #bothumor My software never has bugs. It just develops random features. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230727T1000). [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230727T1000) [10:02:42] (03PS1) 10Jcrespo: dbbackups: Reorganize backups so dbprov1004,2006 creates 10.6 ones [puppet] - 10https://gerrit.wikimedia.org/r/942380 (https://phabricator.wikimedia.org/T334650) [10:05:36] (03PS2) 10Jcrespo: dbbackups: Reorganize backups so dbprov1004,2006 create 10.6 ones [puppet] - 10https://gerrit.wikimedia.org/r/942380 (https://phabricator.wikimedia.org/T334650) [10:11:22] (03CR) 10Alexandros Kosiaris: [C: 03+2] apache: Redirect wikifunctions.org to www.wikifunctions.org [puppet] - 10https://gerrit.wikimedia.org/r/941971 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [10:12:00] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs3005.esams.wmnet [10:13:35] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [10:14:10] (03CR) 10Alexandros Kosiaris: [C: 03+2] apache: Actually enable view_urls on wikifunctions.org [puppet] - 10https://gerrit.wikimedia.org/r/941979 (https://phabricator.wikimedia.org/T342794) (owner: 10Jforrester) [10:14:29] (03PS1) 10Hnowlan: api-gateway: sample logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/942382 [10:14:51] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs3005.esams.wmnet [10:15:14] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 18, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:15:36] RECOVERY - PyBal backends health check on lvs3005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:15:40] RECOVERY - pybal on lvs3005 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [10:19:11] (03CR) 10Jcrespo: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/942376/42713/" [puppet] - 10https://gerrit.wikimedia.org/r/942376 (https://phabricator.wikimedia.org/T334650) (owner: 10Jcrespo) [10:20:00] RECOVERY - PyBal connections to etcd on lvs3005 is OK: OK: 12 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [10:24:50] !log kevinbazira@deploy1002 Started deploy [ores/deploy@c30920f]: T342118 [10:24:54] T342118: Add deprecation message for ORES UI - https://phabricator.wikimedia.org/T342118 [10:25:23] (03PS1) 10Majavah: varnish: rewrite m.wikifunctions.org correctly [puppet] - 10https://gerrit.wikimedia.org/r/942383 (https://phabricator.wikimedia.org/T342846) [10:26:55] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Reorganize backups so dbprov1004,2006 create 10.6 ones [puppet] - 10https://gerrit.wikimedia.org/r/942380 (https://phabricator.wikimedia.org/T334650) (owner: 10Jcrespo) [10:30:28] (03CR) 10Jbond: [C: 03+1] sre.dns.netbox: use cumin alias for Netbox hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/941757 (owner: 10Volans) [10:31:37] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/941758 (owner: 10Volans) [10:31:54] (03CR) 10Alexandros Kosiaris: api-gateway: sample logs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/942382 (owner: 10Hnowlan) [10:32:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:33:40] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/941759 (https://phabricator.wikimedia.org/T297516) (owner: 10Volans) [10:33:55] !log kevinbazira@deploy1002 Finished deploy [ores/deploy@c30920f]: T342118 (duration: 09m 04s) [10:33:59] T342118: Add deprecation message for ORES UI - https://phabricator.wikimedia.org/T342118 [10:34:45] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Ladsgroup) Random idea: The expiry header can be set either with sampling (let's say 1/10th) OR if the expiry is near in the futu... [10:34:47] !log done restarting lvs3005 (T335835) [10:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:59] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/941760 (owner: 10Volans) [10:35:07] !log begin restarting lvs3006 (T335835) [10:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:12] !log purge edge caches for "https://wikifunctions.org/" [10:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:37:16] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:39:10] PROBLEM - pybal on lvs3006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [10:39:36] PROBLEM - PyBal backends health check on lvs3006 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [10:40:10] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:40:34] PROBLEM - PyBal connections to etcd on lvs3006 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [10:41:27] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] eqiad1: cloudnet: enable cloud-private subnet [puppet] - 10https://gerrit.wikimedia.org/r/941445 (https://phabricator.wikimedia.org/T342619) (owner: 10Arturo Borrero Gonzalez) [10:50:06] !log aborrero@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudnet1006 [10:50:22] !log aborrero@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudnet1006 [10:50:25] !log aborrero@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudnet1005 [10:50:28] !log aborrero@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudnet1005 [10:50:54] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [10:52:49] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudnet1005/1006 - aborrero@cumin1001" [10:53:35] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudnet1005/1006 - aborrero@cumin1001" [10:53:35] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:54:57] (03PS1) 10Jbond: ssh::known_hosts: add new known_hosts functions [puppet] - 10https://gerrit.wikimedia.org/r/942389 [10:55:56] (03CR) 10Jbond: DO NOT MERGE: Remove hostname from ssh known_hosts aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/941543 (owner: 10JHathaway) [10:56:41] 10SRE, 10Infrastructure-Foundations, 10netops: Add per-output queue graphing for Juniper network devices in LibreNMS - https://phabricator.wikimedia.org/T326322 (10ayounsi) I tested gNMI on lsw1-a1-codfw as it's not in production yet. After upgrading it to Junos 22.2. I configured gnmic with a basic: `lang=... [10:56:55] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42714/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/942389 (owner: 10Jbond) [10:57:04] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [10:58:44] (03PS2) 10Jbond: ssh::known_hosts: add new known_hosts functions [puppet] - 10https://gerrit.wikimedia.org/r/942389 [11:00:21] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs3006.esams.wmnet [11:00:30] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42715/console" [puppet] - 10https://gerrit.wikimedia.org/r/942389 (owner: 10Jbond) [11:02:32] (03PS2) 10Hnowlan: api-gateway: sample logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/942382 [11:03:09] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs3006.esams.wmnet [11:03:14] (03CR) 10Jbond: [V: 03+1] ssh::known_hosts: add new known_hosts functions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/942389 (owner: 10Jbond) [11:03:16] PROBLEM - Host lvs3006 is DOWN: PING CRITICAL - Packet loss = 100% [11:03:30] RECOVERY - Host lvs3006 is UP: PING OK - Packet loss = 0%, RTA = 81.04 ms [11:03:52] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 18, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:04:18] RECOVERY - pybal on lvs3006 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [11:04:46] RECOVERY - PyBal backends health check on lvs3006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:07:56] RECOVERY - PyBal connections to etcd on lvs3006 is OK: OK: 4 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [11:08:23] (03PS1) 10Ladsgroup: rdbms: Avoid making wasteful memcached calls in CP [core] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/942000 (https://phabricator.wikimedia.org/T314434) [11:08:36] (03PS1) 10Ladsgroup: rdbms: Avoid making wasteful memcached calls in CP [core] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/942001 (https://phabricator.wikimedia.org/T314434) [11:08:49] (03PS1) 10Ladsgroup: CentralAuthUser: Don't load user information unless needed [extensions/CentralAuth] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/942002 [11:09:01] (03PS1) 10Ladsgroup: CentralAuthUser: Don't load user information unless needed [extensions/CentralAuth] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/942003 [11:09:21] jouncebot: nowandnext [11:09:21] No deployments scheduled for the next 1 hour(s) and 50 minute(s) [11:09:21] In 1 hour(s) and 50 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230727T1300) [11:09:21] In 1 hour(s) and 50 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230727T1300) [11:09:26] cool [11:09:39] (03CR) 10Ladsgroup: [C: 03+2] CentralAuthUser: Don't load user information unless needed [extensions/CentralAuth] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/942002 (owner: 10Ladsgroup) [11:10:51] (03PS1) 10Arturo Borrero Gonzalez: wmcs: cloud_private_subnet: fail if DNS records cannot be resolved [puppet] - 10https://gerrit.wikimedia.org/r/942394 [11:12:33] !log done restarting lvs3006 (T335835) [11:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:48] (03CR) 10CI reject: [V: 04-1] CentralAuthUser: Don't load user information unless needed [extensions/CentralAuth] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/942002 (owner: 10Ladsgroup) [11:16:28] (03PS3) 10Jbond: ssh::known_hosts: add new known_hosts functions [puppet] - 10https://gerrit.wikimedia.org/r/942389 [11:16:45] (03CR) 10Ladsgroup: [C: 03+2] "." [extensions/CentralAuth] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/942002 (owner: 10Ladsgroup) [11:16:47] hmmm [11:18:10] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42716/console" [puppet] - 10https://gerrit.wikimedia.org/r/942389 (owner: 10Jbond) [11:20:57] (03CR) 10Jbond: [V: 03+1] "from pcc this is at least working if it becomes to annoying for people but would like to add some test before merging" [puppet] - 10https://gerrit.wikimedia.org/r/942389 (owner: 10Jbond) [11:21:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [11:21:29] (03Merged) 10jenkins-bot: CentralAuthUser: Don't load user information unless needed [extensions/CentralAuth] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/942002 (owner: 10Ladsgroup) [11:21:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [11:21:48] (03PS2) 10Arturo Borrero Gonzalez: wmcs: cloud_private_subnet: fail if DNS records cannot be resolved [puppet] - 10https://gerrit.wikimedia.org/r/942394 [11:22:32] (03CR) 10Ladsgroup: [C: 03+2] CentralAuthUser: Don't load user information unless needed [extensions/CentralAuth] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/942003 (owner: 10Ladsgroup) [11:22:52] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:942002|CentralAuthUser: Don't load user information unless needed]] [11:24:17] (03CR) 10Jbond: "LGTM but lets be a bit stricter on the type and nice find" [puppet] - 10https://gerrit.wikimedia.org/r/941391 (https://phabricator.wikimedia.org/T320390) (owner: 10Slyngshede) [11:24:22] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:942002|CentralAuthUser: Don't load user information unless needed]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [11:24:27] (03CR) 10CI reject: [V: 04-1] wmcs: cloud_private_subnet: fail if DNS records cannot be resolved [puppet] - 10https://gerrit.wikimedia.org/r/942394 (owner: 10Arturo Borrero Gonzalez) [11:25:30] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/941912 [11:27:58] (03Merged) 10jenkins-bot: CentralAuthUser: Don't load user information unless needed [extensions/CentralAuth] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/942003 (owner: 10Ladsgroup) [11:30:39] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:942002|CentralAuthUser: Don't load user information unless needed]] (duration: 07m 47s) [11:31:42] !log ladsgroup@deploy1002 backport Cancelled [11:35:28] (03Abandoned) 10Ladsgroup: rdbms: Avoid making wasteful memcached calls in CP [core] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/942001 (https://phabricator.wikimedia.org/T314434) (owner: 10Ladsgroup) [11:35:32] (03CR) 10Ladsgroup: [C: 03+2] rdbms: Avoid making wasteful memcached calls in CP [core] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/942000 (https://phabricator.wikimedia.org/T314434) (owner: 10Ladsgroup) [11:37:26] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-main-codfw cluster: Roll restart of jvm daemons. [11:37:46] jouncebot: nowandnext [11:37:46] No deployments scheduled for the next 1 hour(s) and 22 minute(s) [11:37:46] In 1 hour(s) and 22 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230727T1300) [11:37:46] In 1 hour(s) and 22 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230727T1300) [11:37:57] I'll slip out the logo for Wikifunctions now. [11:38:09] (03CR) 10Jforrester: [C: 03+2] mailmap: Add mapping for urbanecm's addresses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942321 (owner: 10Urbanecm) [11:38:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942030 (owner: 10Jforrester) [11:38:48] (03Merged) 10jenkins-bot: mailmap: Add mapping for urbanecm's addresses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942321 (owner: 10Urbanecm) [11:39:22] (03Merged) 10jenkins-bot: Wikifunctions: Add logo, wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942030 (owner: 10Jforrester) [11:39:50] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:942030|Wikifunctions: Add logo, wordmark]] [11:40:12] (03PS1) 10Fabfur: Bump target distribution to Bookworm [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/942414 [11:40:41] ty James_F [11:41:23] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:942030|Wikifunctions: Add logo, wordmark]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [11:42:54] !log aborrero@cumin1001 START - Cookbook sre.dns.wipe-cache cloudnet1005.private.eqiad.wikimedia.cloud on all recursors [11:42:57] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudnet1005.private.eqiad.wikimedia.cloud on all recursors [11:43:32] urbanecm: Of course! :-) [11:43:38] urbanecm: Have you made the same commit in core? [11:43:50] yes, and it's merged already. [11:43:52] :) [11:44:14] Ace. [11:44:55] (03CR) 10Jforrester: [C: 03+1] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/942383 (https://phabricator.wikimedia.org/T342846) (owner: 10Majavah) [11:47:04] (03CR) 10Urbanecm: GrowthExperiments: enable AddLink task frontend in 10th round of wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940347 (https://phabricator.wikimedia.org/T308135) (owner: 10Sergio Gimeno) [11:47:10] (03PS2) 10Urbanecm: GrowthExperiments: enable AddLink task frontend in 10th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940347 (https://phabricator.wikimedia.org/T308135) (owner: 10Sergio Gimeno) [11:47:22] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [11:47:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:48:03] (03PS3) 10Jbond: wmcs: cloud_private_subnet: fail if DNS records cannot be resolved [puppet] - 10https://gerrit.wikimedia.org/r/942394 (owner: 10Arturo Borrero Gonzalez) [11:48:20] (03PS1) 10Jcrespo: mariadb: Disable notifications for db2097, db2141 [puppet] - 10https://gerrit.wikimedia.org/r/942415 (https://phabricator.wikimedia.org/T334650) [11:48:22] (03PS1) 10Jcrespo: mariadb: Move s6 from db2141 to db2097 and drop s1 & add x1 [puppet] - 10https://gerrit.wikimedia.org/r/942416 (https://phabricator.wikimedia.org/T334650) [11:48:25] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:942030|Wikifunctions: Add logo, wordmark]] (duration: 08m 35s) [11:48:26] (03CR) 10Fabfur: "Lintian is quite happy with the binary package (at least on build2001):" [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/942414 (owner: 10Fabfur) [11:49:11] (03CR) 10Jcrespo: [C: 03+2] mariadb: Disable notifications for db2097, db2141 [puppet] - 10https://gerrit.wikimedia.org/r/942415 (https://phabricator.wikimedia.org/T334650) (owner: 10Jcrespo) [11:49:20] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudnet2005-dev/2006-dev - aborrero@cumin1001" [11:50:02] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudnet2005-dev/2006-dev - aborrero@cumin1001" [11:50:02] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:50:29] (03Merged) 10jenkins-bot: rdbms: Avoid making wasteful memcached calls in CP [core] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/942000 (https://phabricator.wikimedia.org/T314434) (owner: 10Ladsgroup) [11:50:41] !log aborrero@cumin1001 START - Cookbook sre.dns.wipe-cache cloudnet2006-dev.private.codfw.wikimedia.cloud on all recursors [11:50:44] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudnet2006-dev.private.codfw.wikimedia.cloud on all recursors [11:50:52] !log aborrero@cumin1001 START - Cookbook sre.dns.wipe-cache cloudnet2005-dev.private.codfw.wikimedia.cloud on all recursors [11:50:54] (03CR) 10CI reject: [V: 04-1] wmcs: cloud_private_subnet: fail if DNS records cannot be resolved [puppet] - 10https://gerrit.wikimedia.org/r/942394 (owner: 10Arturo Borrero Gonzalez) [11:50:55] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudnet2005-dev.private.codfw.wikimedia.cloud on all recursors [11:51:10] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:942000|rdbms: Avoid making wasteful memcached calls in CP (T314434)]] [11:51:14] T314434: Avoid ChronologyProtector queries on majory of pageviews that have no recent positions - https://phabricator.wikimedia.org/T314434 [11:52:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:52:40] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:942000|rdbms: Avoid making wasteful memcached calls in CP (T314434)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [11:54:47] (03PS1) 10Jforrester: Wikifunctions: Also add square logo for Vector-2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942417 [11:55:03] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942417 (owner: 10Jforrester) [11:55:48] (03Merged) 10jenkins-bot: Wikifunctions: Also add square logo for Vector-2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942417 (owner: 10Jforrester) [12:00:05] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:942000|rdbms: Avoid making wasteful memcached calls in CP (T314434)]] (duration: 08m 54s) [12:00:17] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:942417|Wikifunctions: Also add square logo for Vector-2022]] [12:00:22] T314434: Avoid ChronologyProtector queries on majory of pageviews that have no recent positions - https://phabricator.wikimedia.org/T314434 [12:01:43] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:942417|Wikifunctions: Also add square logo for Vector-2022]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [12:06:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [12:06:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [12:06:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:07:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:07:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T342617)', diff saved to https://phabricator.wikimedia.org/P49747 and previous config saved to /var/cache/conftool/dbconfig/20230727-120710-ladsgroup.json [12:07:14] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [12:07:23] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:942417|Wikifunctions: Also add square logo for Vector-2022]] (duration: 07m 05s) [12:08:29] (03CR) 10Alexandros Kosiaris: [C: 03+1] api-gateway: sample logs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/942382 (owner: 10Hnowlan) [12:08:56] !log systemctl stop mariadb@s1 @ db2097 [12:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:00] James_F: is there one of those 'post-creation work' tasks for wikifunctions? should there be one? [12:13:24] taavi: There’s a column. [12:13:48] Just add a comment on the main task? [12:14:50] 10SRE, 10Infrastructure-Foundations, 10netops: Add per-output queue graphing for Juniper network devices in LibreNMS - https://phabricator.wikimedia.org/T326322 (10cmooney) Nice work! Bit of work to get it productionized for sure but great to see it working! [12:15:00] not sure. mostly I'm thinking of stuff like restbase, pywikibot, etc that those tasks usually have [12:15:29] (T336115 for example is what I'm talking about) [12:15:30] T336115: Post-creation work for btmwiktionary - https://phabricator.wikimedia.org/T336115 [12:18:16] (03CR) 10Jbond: [C: 04-1] "-1: i have left specific comments inline but its unclear what the end goal is here so difficult to advice on different ways forward." [puppet] - 10https://gerrit.wikimedia.org/r/941476 (owner: 10Krinkle) [12:18:35] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [12:19:41] (03PS3) 10Slyngshede: D:apereo_cas::service support FLAT profiles. [puppet] - 10https://gerrit.wikimedia.org/r/941391 (https://phabricator.wikimedia.org/T320390) [12:19:49] (03CR) 10Slyngshede: D:apereo_cas::service support FLAT profiles. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/941391 (https://phabricator.wikimedia.org/T320390) (owner: 10Slyngshede) [12:20:06] (03PS4) 10Arturo Borrero Gonzalez: wmcs: cloud_private_subnet: fail if DNS records cannot be resolved [puppet] - 10https://gerrit.wikimedia.org/r/942394 [12:24:47] (03PS5) 10Arturo Borrero Gonzalez: wmcs: cloud_private_subnet: fail if DNS records cannot be resolved [puppet] - 10https://gerrit.wikimedia.org/r/942394 [12:31:46] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Nice! A few comments inline, but looks already pretty good" [deployment-charts] - 10https://gerrit.wikimedia.org/r/938256 (https://phabricator.wikimedia.org/T324200) (owner: 10Kamila Součková) [12:31:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T342617)', diff saved to https://phabricator.wikimedia.org/P49748 and previous config saved to /var/cache/conftool/dbconfig/20230727-123153-ladsgroup.json [12:31:59] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [12:36:41] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1, but I 'd prefer if traffic reviews and merges." [puppet] - 10https://gerrit.wikimedia.org/r/942383 (https://phabricator.wikimedia.org/T342846) (owner: 10Majavah) [12:36:45] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/941391 (https://phabricator.wikimedia.org/T320390) (owner: 10Slyngshede) [12:37:05] (03PS2) 10Jcrespo: mariadb: Move s6 from db2141 to db2097 and drop s1 & add x1 [puppet] - 10https://gerrit.wikimedia.org/r/942416 (https://phabricator.wikimedia.org/T334650) [12:37:38] 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10RobH) [12:37:50] (03PS1) 10Filippo Giunchedi: hieradata: finish cadvisor rollout on k8s-aux [puppet] - 10https://gerrit.wikimedia.org/r/942420 (https://phabricator.wikimedia.org/T108027) [12:38:11] 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10RobH) [12:38:42] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42718/console" [puppet] - 10https://gerrit.wikimedia.org/r/941391 (https://phabricator.wikimedia.org/T320390) (owner: 10Slyngshede) [12:40:54] (03PS3) 10Jcrespo: mariadb: Move s6 from db2141 to db2097 and drop s1 & add x1 [puppet] - 10https://gerrit.wikimedia.org/r/942416 (https://phabricator.wikimedia.org/T334650) [12:40:57] (03PS1) 10Alexandros Kosiaris: Run sextant update charts/ [deployment-charts] - 10https://gerrit.wikimedia.org/r/942421 [12:41:03] (03CR) 10CI reject: [V: 04-1] mariadb: Move s6 from db2141 to db2097 and drop s1 & add x1 [puppet] - 10https://gerrit.wikimedia.org/r/942416 (https://phabricator.wikimedia.org/T334650) (owner: 10Jcrespo) [12:41:06] (03PS4) 10Jcrespo: mariadb: Move s6 from db2141 to db2097 and drop s1 & add x1 [puppet] - 10https://gerrit.wikimedia.org/r/942416 (https://phabricator.wikimedia.org/T334650) [12:41:49] (03CR) 10Alexandros Kosiaris: "Requesting input. Is this what we should expect when running sextant update? Are all these removals OK ?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/942421 (owner: 10Alexandros Kosiaris) [12:43:09] !log begin restarting lvs6003 (T335835) [12:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:08] (03CR) 10Samtar: "ack'd as OK on Slack, ready for deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942419 (https://phabricator.wikimedia.org/T342858) (owner: 10Samtar) [12:44:22] (03CR) 10Jcrespo: [C: 03+2] mariadb: Move s6 from db2141 to db2097 and drop s1 & add x1 [puppet] - 10https://gerrit.wikimedia.org/r/942416 (https://phabricator.wikimedia.org/T334650) (owner: 10Jcrespo) [12:44:38] (03CR) 10Jcrespo: [C: 03+2] make dumpsdata1006 the xmlfallback host [puppet] - 10https://gerrit.wikimedia.org/r/908995 (https://phabricator.wikimedia.org/T325232) (owner: 10Hokwelum) [12:44:41] (03CR) 10Alexandros Kosiaris: [C: 03+1] hieradata: finish cadvisor rollout on k8s-aux [puppet] - 10https://gerrit.wikimedia.org/r/942420 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [12:45:16] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs6003.drmrs.wmnet [12:45:18] jouncebot: nowandnext [12:45:18] No deployments scheduled for the next 0 hour(s) and 14 minute(s) [12:45:18] In 0 hour(s) and 14 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230727T1300) [12:45:18] In 0 hour(s) and 14 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230727T1300) [12:45:49] (03CR) 10Jcrespo: "Sorry, I accidentally misclicked +2 here. Not intended." [puppet] - 10https://gerrit.wikimedia.org/r/908995 (https://phabricator.wikimedia.org/T325232) (owner: 10Hokwelum) [12:46:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942419 (https://phabricator.wikimedia.org/T342858) (owner: 10Samtar) [12:46:41] (03Merged) 10jenkins-bot: IS-labs: Enable edit recovery on en.wikipedia.beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942419 (https://phabricator.wikimedia.org/T342858) (owner: 10Samtar) [12:47:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P49749 and previous config saved to /var/cache/conftool/dbconfig/20230727-124700-ladsgroup.json [12:47:03] (beta only, done) [12:48:04] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:48:15] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs6003.drmrs.wmnet [12:48:24] PROBLEM - Host lvs6003 is DOWN: PING CRITICAL - Packet loss = 100% [12:48:28] RECOVERY - Host lvs6003 is UP: PING OK - Packet loss = 0%, RTA = 86.31 ms [12:48:40] PROBLEM - pybal on lvs6003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [12:49:20] ^^ it's me rebooting lvs6003 (the cookbook sometimes doesn't really donwntimes the host) [12:49:32] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:49:50] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hdfs_rsync_analytics_hadoop_published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:50:03] (ProbeDown) firing: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:50:06] RECOVERY - pybal on lvs6003 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [12:50:48] (03PS1) 10EoghanGaffney: releases: Add final chown to /srv/patches quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/942422 (https://phabricator.wikimedia.org/T342016) [12:53:57] !log done restarting lvs6003 (T335835) [12:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:34] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/942422 (https://phabricator.wikimedia.org/T342016) (owner: 10EoghanGaffney) [12:54:49] !log begin restarting lvs6001 (T335835) [12:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:03] (ProbeDown) resolved: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:56:00] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42719/console" [puppet] - 10https://gerrit.wikimedia.org/r/942422 (https://phabricator.wikimedia.org/T342016) (owner: 10EoghanGaffney) [12:56:19] (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] releases: Add final chown to /srv/patches quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/942422 (https://phabricator.wikimedia.org/T342016) (owner: 10EoghanGaffney) [12:59:41] PROBLEM - PyBal backends health check on lvs6001 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [12:59:41] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:59:53] PROBLEM - PyBal connections to etcd on lvs6001 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [13:00:06] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230727T1300) [13:00:06] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230727T1300). [13:00:06] duesen: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:25] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:01:01] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:03] RECOVERY - BFD status on cr2-eqdfw is OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:01:37] * TheresNoTime can deploy [13:02:05] (03PS2) 10Samtar: Re-enable PC writes for parsoid endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941946 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler) [13:02:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P49750 and previous config saved to /var/cache/conftool/dbconfig/20230727-130206-ladsgroup.json [13:03:13] TheresNoTime: i'm here [13:03:32] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941946 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler) [13:04:16] (03Merged) 10jenkins-bot: Re-enable PC writes for parsoid endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941946 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler) [13:04:33] PROBLEM - pybal on lvs6001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [13:04:33] !log samtar@deploy1002 Started scap: Backport for [[gerrit:941946|Re-enable PC writes for parsoid endpoints (T339867)]] [13:04:36] T339867: RESTbase: Turn off pre-generation and caching for parsoid endpoints - https://phabricator.wikimedia.org/T339867 [13:04:45] TheresNoTime: the patch isn't testable on debug hosts, the code is triggered when restbase talks to rest endpoints on the parsoid cluster. I'll keep an eye on grafana, but this is just going back to the config as it was a month ago. [13:04:58] duesen: ack, will sync straight away [13:05:43] ok, cool. [13:05:57] !log samtar@deploy1002 samtar and daniel: Backport for [[gerrit:941946|Re-enable PC writes for parsoid endpoints (T339867)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:06:04] (sync) [13:11:35] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:941946|Re-enable PC writes for parsoid endpoints (T339867)]] (duration: 07m 02s) [13:11:39] T339867: RESTbase: Turn off pre-generation and caching for parsoid endpoints - https://phabricator.wikimedia.org/T339867 [13:11:47] duesen: live :) [13:12:00] TheresNoTime: i see the cache writes ramping up [13:12:03] (03CR) 10Jbond: puppetserver: make notifying configurable (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/939643 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [13:12:04] looking good [13:12:12] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs6001.drmrs.wmnet [13:12:15] * TheresNoTime will be around for 30mins, if there are any other patches needing deployment? [13:13:52] (03CR) 10Jbond: puppetserver: Add a file to track when a service restart or reload is (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940365 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [13:15:11] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs6001.drmrs.wmnet [13:15:44] !log done restarting lvs6001 (T335835) [13:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:50] (03CR) 10Jbond: [V: 03+1] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/940366 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [13:15:53] RECOVERY - pybal on lvs6001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [13:16:01] (03PS8) 10Jbond: puppetserver: make notifying configurable [puppet] - 10https://gerrit.wikimedia.org/r/939643 (https://phabricator.wikimedia.org/T330490) [13:16:03] (03PS4) 10Jbond: puppetserver: Add a file to track when a service restart or reload is [puppet] - 10https://gerrit.wikimedia.org/r/940365 (https://phabricator.wikimedia.org/T330490) [13:16:05] (03PS4) 10Jbond: motd: Add motd indicating services which need restarting [puppet] - 10https://gerrit.wikimedia.org/r/940366 (https://phabricator.wikimedia.org/T330490) [13:16:07] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:16:11] RECOVERY - PyBal backends health check on lvs6001 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:16:26] (03CR) 10Hnowlan: [C: 03+2] changeprop: bump node-rdkafka, use buster base (prod version) [deployment-charts] - 10https://gerrit.wikimedia.org/r/941780 (https://phabricator.wikimedia.org/T341140) (owner: 10Elukey) [13:17:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T342617)', diff saved to https://phabricator.wikimedia.org/P49751 and previous config saved to /var/cache/conftool/dbconfig/20230727-131712-ladsgroup.json [13:17:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [13:17:16] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [13:17:21] (03Merged) 10jenkins-bot: changeprop: bump node-rdkafka, use buster base (prod version) [deployment-charts] - 10https://gerrit.wikimedia.org/r/941780 (https://phabricator.wikimedia.org/T341140) (owner: 10Elukey) [13:17:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [13:17:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T342617)', diff saved to https://phabricator.wikimedia.org/P49752 and previous config saved to /var/cache/conftool/dbconfig/20230727-131733-ladsgroup.json [13:17:37] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: apply [13:18:23] TheresNoTime: the write rate for the "parsoid" ParserCache is surprisingly high right now. Not super extreme, but worth keeping an eye on. It's more than I would have expected. [13:18:27] Amir1: --^ [13:18:30] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [13:18:31] RECOVERY - PyBal connections to etcd on lvs6001 is OK: OK: 12 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [13:18:38] duesen: ack, still here if you need to revert [13:18:45] 10sre-alert-triage, 10Data-Platform-SRE: 404 from nginx on wcqs2001 - https://phabricator.wikimedia.org/T342762 (10bking) [13:19:11] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:19:25] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:20:31] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:22:17] (03PS5) 10Jbond: pcc: update the parse commit method to support "Change-Private:" footer [puppet] - 10https://gerrit.wikimedia.org/r/937534 (https://phabricator.wikimedia.org/T265633) [13:22:21] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:23:18] TheresNoTime: it's fine with me, I'm just waiting for Amir to complain ;) The write rate is stabilizing at 50k/min, which is higher than the 40k/minute we were seeing before the experiment. Bot not catastrophic. [13:23:31] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:24:19] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.289 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:24:28] * Lucas_WMDE also around for a while in case TNT has to leave [13:24:33] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50276 bytes in 0.079 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:25:09] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: finish cadvisor rollout on k8s-aux [puppet] - 10https://gerrit.wikimedia.org/r/942420 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [13:25:53] (03PS10) 10Jbond: sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/939738 [13:25:55] (03CR) 10Jbond: "done" [cookbooks] - 10https://gerrit.wikimedia.org/r/939738 (owner: 10Jbond) [13:27:43] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:28:28] (03CR) 10Btullis: flink-zk: Initiate new flink::zookeeper role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [13:29:49] (03PS7) 10Kamila Součková: add Benthos smoke test to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/938256 (https://phabricator.wikimedia.org/T324200) [13:30:28] (03PS12) 10Bking: flink-zk: Initiate new flink::zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) [13:30:30] (03CR) 10Kamila Součková: "Thank you Alex!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/938256 (https://phabricator.wikimedia.org/T324200) (owner: 10Kamila Součková) [13:31:17] (03CR) 10Btullis: flink-zk: Initiate new flink::zookeeper role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [13:31:29] (03PS1) 10Filippo Giunchedi: hieradata: complete cadvisor rollout on k8s [puppet] - 10https://gerrit.wikimedia.org/r/942426 (https://phabricator.wikimedia.org/T108027) [13:31:38] (03CR) 10Bking: flink-zk: Initiate new flink::zookeeper role (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [13:32:26] !log begin restarting lvs6002 (T335835) [13:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:37] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [13:32:52] (03PS2) 10Jbond: vtrs: drop bashisms and fix other CI issues [puppet] - 10https://gerrit.wikimedia.org/r/940379 (https://phabricator.wikimedia.org/T95064) [13:33:04] (03CR) 10Jbond: vtrs: drop bashisms and fix other CI issues (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940379 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond) [13:34:02] (03CR) 10Jbond: [C: 03+1] D:apereo_cas::service support FLAT profiles. [puppet] - 10https://gerrit.wikimedia.org/r/941391 (https://phabricator.wikimedia.org/T320390) (owner: 10Slyngshede) [13:34:34] (03CR) 10Jbond: [C: 03+1] D:apereo_cas::service support FLAT profiles. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/941391 (https://phabricator.wikimedia.org/T320390) (owner: 10Slyngshede) [13:34:54] PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:35:12] PROBLEM - PyBal connections to etcd on lvs6002 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [13:35:36] PROBLEM - PyBal backends health check on lvs6002 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [13:35:44] PROBLEM - pybal on lvs6002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [13:36:49] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [13:36:59] (03CR) 10Stevemunene: [C: 03+1] flink-zk: Initiate new flink::zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [13:37:43] (03CR) 10Bking: [C: 03+2] flink-zk: Initiate new flink::zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [13:39:49] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/938256 (https://phabricator.wikimedia.org/T324200) (owner: 10Kamila Součková) [13:41:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T342617)', diff saved to https://phabricator.wikimedia.org/P49754 and previous config saved to /var/cache/conftool/dbconfig/20230727-134141-ladsgroup.json [13:41:47] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [13:41:56] (03PS1) 10Elukey: ores-legacy: increase resources for the envoy proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/942427 (https://phabricator.wikimedia.org/T341479) [13:43:30] (03PS1) 10Bking: flink-zk: use correct variable for firewall defs [puppet] - 10https://gerrit.wikimedia.org/r/942428 (https://phabricator.wikimedia.org/T341792) [13:45:08] (03CR) 10Btullis: [C: 03+1] flink-zk: use correct variable for firewall defs [puppet] - 10https://gerrit.wikimedia.org/r/942428 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [13:45:38] (03CR) 10Bking: [C: 03+2] flink-zk: use correct variable for firewall defs [puppet] - 10https://gerrit.wikimedia.org/r/942428 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [13:45:57] (03CR) 10Kamila Součková: [C: 03+2] add Benthos smoke test to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/938256 (https://phabricator.wikimedia.org/T324200) (owner: 10Kamila Součková) [13:46:48] (03Merged) 10jenkins-bot: add Benthos smoke test to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/938256 (https://phabricator.wikimedia.org/T324200) (owner: 10Kamila Součková) [13:49:38] !log kamila@deploy1002 helmfile [staging] START helmfile.d/services/benthos-cache-invalidator: apply [13:49:40] !log kamila@deploy1002 helmfile [staging] DONE helmfile.d/services/benthos-cache-invalidator: apply [13:49:55] (03PS1) 10Marostegui: install_server: Add dbstore100[89] to partman [puppet] - 10https://gerrit.wikimedia.org/r/942431 (https://phabricator.wikimedia.org/T342862) [13:50:27] (03CR) 10Marostegui: [C: 03+2] install_server: Add dbstore100[89] to partman [puppet] - 10https://gerrit.wikimedia.org/r/942431 (https://phabricator.wikimedia.org/T342862) (owner: 10Marostegui) [13:51:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10Marostegui) I have assigned the recipe already with the above patch. [13:51:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10Marostegui) @BTullis any reason why this needs AAAA records records? The other hosts do not have them and it will likely give some headaches with the mysql gra... [13:52:02] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs6002.drmrs.wmnet [13:53:29] (03PS1) 10Marostegui: dbstore100[89]: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/942432 (https://phabricator.wikimedia.org/T342862) [13:55:01] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs6002.drmrs.wmnet [13:55:02] PROBLEM - PyBal backends health check on lvs6002 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [13:55:12] PROBLEM - pybal on lvs6002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [13:55:24] !log done restarting lvs6002 (T335835) [13:55:26] RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:08] RECOVERY - PyBal connections to etcd on lvs6002 is OK: OK: 4 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [13:56:13] (03CR) 10Marostegui: [C: 03+2] dbstore100[89]: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/942432 (https://phabricator.wikimedia.org/T342862) (owner: 10Marostegui) [13:56:26] RECOVERY - PyBal backends health check on lvs6002 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:56:36] RECOVERY - pybal on lvs6002 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [13:56:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P49755 and previous config saved to /var/cache/conftool/dbconfig/20230727-135648-ladsgroup.json [13:58:28] RECOVERY - Zookeeper Server on flink-zk1003 is OK: PROCS OK: 1 process with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper [14:00:38] (03CR) 10Vgutierrez: Bump target distribution to Bookworm (031 comment) [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/942414 (owner: 10Fabfur) [14:03:27] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ores-legacy: increase resources for the envoy proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/942427 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey) [14:07:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:08:27] (03CR) 10Elukey: [C: 03+2] ores-legacy: increase resources for the envoy proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/942427 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey) [14:11:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:11:48] (03PS4) 10Slyngshede: D:apereo_cas::service support FLAT profiles. [puppet] - 10https://gerrit.wikimedia.org/r/941391 (https://phabricator.wikimedia.org/T320390) [14:11:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P49756 and previous config saved to /var/cache/conftool/dbconfig/20230727-141154-ladsgroup.json [14:13:30] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: send_tile_invalidations.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:27] 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install X - https://phabricator.wikimedia.org/T342892 (10RobH) [14:14:40] 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install pki1002 - https://phabricator.wikimedia.org/T342892 (10RobH) [14:15:14] 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install pki1002 - https://phabricator.wikimedia.org/T342892 (10RobH) [14:16:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:16:42] (03PS1) 10Hnowlan: Revert "changeprop: bump node-rdkafka, use buster base (prod version)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/942005 [14:16:45] (03PS1) 10Ilias Sarantopoulos: ml-services: fix eswikiquote [deployment-charts] - 10https://gerrit.wikimedia.org/r/942439 [14:17:05] (03CR) 10Elukey: [C: 03+1] ml-services: fix eswikiquote [deployment-charts] - 10https://gerrit.wikimedia.org/r/942439 (owner: 10Ilias Sarantopoulos) [14:17:33] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:41] (03CR) 10Elukey: [C: 03+1] Revert "changeprop: bump node-rdkafka, use buster base (prod version)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/942005 (owner: 10Hnowlan) [14:17:55] (03CR) 10Hnowlan: [C: 03+2] Revert "changeprop: bump node-rdkafka, use buster base (prod version)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/942005 (owner: 10Hnowlan) [14:18:39] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: fix eswikiquote [deployment-charts] - 10https://gerrit.wikimedia.org/r/942439 (owner: 10Ilias Sarantopoulos) [14:18:41] (03CR) 10Slyngshede: D:apereo_cas::service support FLAT profiles. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/941391 (https://phabricator.wikimedia.org/T320390) (owner: 10Slyngshede) [14:18:43] (03Merged) 10jenkins-bot: Revert "changeprop: bump node-rdkafka, use buster base (prod version)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/942005 (owner: 10Hnowlan) [14:19:17] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: apply [14:19:23] (03Merged) 10jenkins-bot: ml-services: fix eswikiquote [deployment-charts] - 10https://gerrit.wikimedia.org/r/942439 (owner: 10Ilias Sarantopoulos) [14:19:43] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [14:19:47] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/941391 (https://phabricator.wikimedia.org/T320390) (owner: 10Slyngshede) [14:20:11] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [14:22:38] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device ssw1-a1-codfw.mgmt.codfw.wmnet [14:22:40] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [14:23:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:24:08] RECOVERY - Zookeeper Server on flink-zk1001 is OK: PROCS OK: 1 process with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper [14:24:28] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [14:24:34] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - cmooney@cumin1001" [14:25:20] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - cmooney@cumin1001" [14:25:20] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:25:38] PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: send_tile_invalidations.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:18] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device ssw1-a1-codfw.mgmt.codfw.wmnet [14:27:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T342617)', diff saved to https://phabricator.wikimedia.org/P49757 and previous config saved to /var/cache/conftool/dbconfig/20230727-142700-ladsgroup.json [14:27:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [14:27:04] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [14:27:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [14:27:19] (03PS1) 10Kamila Součková: kubernetes: add Benthos cache invalidator service [puppet] - 10https://gerrit.wikimedia.org/r/942440 (https://phabricator.wikimedia.org/T324200) [14:27:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1173 (T342617)', diff saved to https://phabricator.wikimedia.org/P49758 and previous config saved to /var/cache/conftool/dbconfig/20230727-142721-ladsgroup.json [14:27:25] (03PS2) 10Ilias Sarantopoulos: ores-extension: enable lw on itwiki and hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941955 (https://phabricator.wikimedia.org/T342115) [14:27:46] !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [14:28:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:33:35] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [14:39:37] (03PS1) 10Kamila Součková: add namespace for benthos-cache-invalidator [deployment-charts] - 10https://gerrit.wikimedia.org/r/942444 (https://phabricator.wikimedia.org/T324200) [14:43:19] (03PS2) 10Fabfur: Bump target distribution to Bookworm [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/942414 (https://phabricator.wikimedia.org/T342154) [14:43:23] (03CR) 10Fabfur: Bump target distribution to Bookworm (031 comment) [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/942414 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [14:45:03] (03CR) 10CI reject: [V: 04-1] Bump target distribution to Bookworm [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/942414 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [14:48:22] (03PS3) 10Fabfur: Bump target distribution to Bookworm [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/942414 (https://phabricator.wikimedia.org/T342154) [14:51:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T342617)', diff saved to https://phabricator.wikimedia.org/P49759 and previous config saved to /var/cache/conftool/dbconfig/20230727-145110-ladsgroup.json [14:51:15] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [14:52:23] (03CR) 10Clément Goubert: [C: 03+1] kubernetes: add Benthos cache invalidator service [puppet] - 10https://gerrit.wikimedia.org/r/942440 (https://phabricator.wikimedia.org/T324200) (owner: 10Kamila Součková) [14:52:31] (03CR) 10Clément Goubert: [C: 03+1] add namespace for benthos-cache-invalidator [deployment-charts] - 10https://gerrit.wikimedia.org/r/942444 (https://phabricator.wikimedia.org/T324200) (owner: 10Kamila Součková) [14:56:08] (03PS1) 10Fabfur: fifo-log-demux: Add socat as companion package [puppet] - 10https://gerrit.wikimedia.org/r/942446 [14:58:52] RECOVERY - puppet last run on pybal-test2003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:02:57] (03CR) 10Jbond: [C: 03+1] D:apereo_cas::service support FLAT profiles. [puppet] - 10https://gerrit.wikimedia.org/r/941391 (https://phabricator.wikimedia.org/T320390) (owner: 10Slyngshede) [15:05:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10BTullis) >>! In T342862#9048212, @Marostegui wrote: > @BTullis any reason why this needs AAAA records records? The other hosts do not have them and it... [15:05:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10BTullis) [15:06:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P49761 and previous config saved to /var/cache/conftool/dbconfig/20230727-150616-ladsgroup.json [15:08:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: cloud_private_subnet: fail if DNS records cannot be resolved [puppet] - 10https://gerrit.wikimedia.org/r/942394 (owner: 10Arturo Borrero Gonzalez) [15:12:12] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10netops, and 2 others: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10aborrero) [15:13:42] (03CR) 10JHathaway: [C: 03+1] puppetserver: Add a file to track when a service restart or reload is [puppet] - 10https://gerrit.wikimedia.org/r/940365 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [15:14:39] (03CR) 10JHathaway: [C: 03+1] puppetserver: make notifying configurable [puppet] - 10https://gerrit.wikimedia.org/r/939643 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [15:15:18] (03CR) 10JHathaway: [C: 03+1] motd: Add motd indicating services which need restarting [puppet] - 10https://gerrit.wikimedia.org/r/940366 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [15:16:04] 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, 10Unstewarded-production-error, 10Wikimedia-production-error: An unknown error occurred in storage backend "local-swift-eqiad" - https://phabricator.wikimedia.org/T341007 (10Krinkle) [15:16:07] (03CR) 10Majavah: [C: 04-1] puppetserver: Add a file to track when a service restart or reload is (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940365 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [15:16:27] (03CR) 10JHathaway: [C: 03+1] vtrs: drop bashisms and fix other CI issues [puppet] - 10https://gerrit.wikimedia.org/r/940379 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond) [15:16:34] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10netops, and 2 others: Puppet interface:: resources aren't cleaned up - https://phabricator.wikimedia.org/T342899 (10aborrero) [15:17:42] 10SRE, 10Traffic: Recompile fifo-log-demux with hardening options - https://phabricator.wikimedia.org/T342900 (10Fabfur) [15:21:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P49762 and previous config saved to /var/cache/conftool/dbconfig/20230727-152123-ladsgroup.json [15:23:22] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) GitLab test instance has OIDC enabled as well now, thanks @jbond for that! We expect a GitLab security update somewhere ar... [15:26:39] (03CR) 10Kamila Součková: [C: 03+2] kubernetes: add Benthos cache invalidator service [puppet] - 10https://gerrit.wikimedia.org/r/942440 (https://phabricator.wikimedia.org/T324200) (owner: 10Kamila Součková) [15:26:46] (03CR) 10Kamila Součková: [C: 03+2] add namespace for benthos-cache-invalidator [deployment-charts] - 10https://gerrit.wikimedia.org/r/942444 (https://phabricator.wikimedia.org/T324200) (owner: 10Kamila Součková) [15:27:17] (03PS2) 10Cory Massaro: Add timeout values in milliseconds as environment variables. [deployment-charts] - 10https://gerrit.wikimedia.org/r/942017 [15:27:23] (03CR) 10Cory Massaro: Add timeout values in milliseconds as environment variables. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/942017 (owner: 10Cory Massaro) [15:29:04] (03Merged) 10jenkins-bot: add namespace for benthos-cache-invalidator [deployment-charts] - 10https://gerrit.wikimedia.org/r/942444 (https://phabricator.wikimedia.org/T324200) (owner: 10Kamila Součková) [15:30:18] (03CR) 10Hnowlan: [C: 03+2] api-gateway: sample logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/942382 (owner: 10Hnowlan) [15:30:30] (03PS1) 10Jbond: geoip: Add docs and fix lint issues [puppet] - 10https://gerrit.wikimedia.org/r/942452 (https://phabricator.wikimedia.org/T342878) [15:30:32] (03PS1) 10Jbond: geoip: drop GeoIP2-Anonymous-IP from syncing [puppet] - 10https://gerrit.wikimedia.org/r/942453 (https://phabricator.wikimedia.org/T342878) [15:31:12] (03Merged) 10jenkins-bot: api-gateway: sample logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/942382 (owner: 10Hnowlan) [15:31:49] (03PS2) 10Fabfur: fifo-log-demux: Add socat as companion package [puppet] - 10https://gerrit.wikimedia.org/r/942446 [15:32:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10Jclark-ctr) Dell they originally requested we take to min configuration to troubleshoot. i advised them that the time between errors is t... [15:32:04] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42720/console" [puppet] - 10https://gerrit.wikimedia.org/r/942453 (https://phabricator.wikimedia.org/T342878) (owner: 10Jbond) [15:32:06] (03CR) 10Btullis: [C: 03+1] geoip: drop GeoIP2-Anonymous-IP from syncing [puppet] - 10https://gerrit.wikimedia.org/r/942453 (https://phabricator.wikimedia.org/T342878) (owner: 10Jbond) [15:32:13] (03CR) 10CI reject: [V: 04-1] fifo-log-demux: Add socat as companion package [puppet] - 10https://gerrit.wikimedia.org/r/942446 (owner: 10Fabfur) [15:32:49] (03PS1) 10Dreamy Jazz: Revert "CheckUser event table migration: Write new on group0" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942466 [15:33:29] (03PS3) 10Fabfur: fifo-log-demux: Add socat as companion package [puppet] - 10https://gerrit.wikimedia.org/r/942446 [15:33:34] (03PS2) 10Dreamy Jazz: Revert "CheckUser event table migration: Write new on group0" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942466 (https://phabricator.wikimedia.org/T342902) [15:33:39] (03PS3) 10Dreamy Jazz: Revert "CheckUser event table migration: Write new on group0" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942466 (https://phabricator.wikimedia.org/T342902) [15:33:53] (03CR) 10jenkins-bot: fifo-log-demux: Add socat as companion package [puppet] - 10https://gerrit.wikimedia.org/r/942446 (owner: 10Fabfur) [15:35:47] Any deployer around to revert a config change for https://phabricator.wikimedia.org/T342902 [15:35:58] (03PS4) 10Fabfur: fifo-log-demux: Add socat as companion package [puppet] - 10https://gerrit.wikimedia.org/r/942446 [15:36:09] jouncebot: nowandnext [15:36:09] No deployments scheduled for the next 0 hour(s) and 23 minute(s) [15:36:09] In 0 hour(s) and 23 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230727T1600) [15:36:13] (03CR) 10Zabe: [C: 03+2] Revert "CheckUser event table migration: Write new on group0" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942466 (https://phabricator.wikimedia.org/T342902) (owner: 10Dreamy Jazz) [15:36:14] Revert patch is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/942466/ [15:36:28] around too, but looks like zabe was faster [15:36:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T342617)', diff saved to https://phabricator.wikimedia.org/P49763 and previous config saved to /var/cache/conftool/dbconfig/20230727-153629-ladsgroup.json [15:36:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [15:36:33] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [15:36:36] Thanks both [15:36:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [15:36:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T342617)', diff saved to https://phabricator.wikimedia.org/P49764 and previous config saved to /var/cache/conftool/dbconfig/20230727-153649-ladsgroup.json [15:36:52] Will file a task to see what schema drifts exist regarding this column [15:36:57] (03Merged) 10jenkins-bot: Revert "CheckUser event table migration: Write new on group0" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942466 (https://phabricator.wikimedia.org/T342902) (owner: 10Dreamy Jazz) [15:37:03] ;) [15:37:12] (03CR) 10Btullis: [C: 03+1] geoip: Add docs and fix lint issues [puppet] - 10https://gerrit.wikimedia.org/r/942452 (https://phabricator.wikimedia.org/T342878) (owner: 10Jbond) [15:37:18] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT certificates) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:37:23] !log zabe@deploy1002 Started scap: Backport for [[gerrit:942466|Revert "CheckUser event table migration: Write new on group0" (T342902)]] [15:37:27] T342902: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'cuc_only_for_read_old' in 'field list'Function: MediaWiki\CheckUser\Hooks::onAuthManagerLoginAuthenticateAuditQuery: INSERT INTO `cu_changes` (cuc_page_id,cuc_namespac - https://phabricator.wikimedia.org/T342902 [15:38:35] Based on logstash this seems limited to testcommonswiki, but there could be others... [15:38:50] !log zabe@deploy1002 zabe and dreamyjazz: Backport for [[gerrit:942466|Revert "CheckUser event table migration: Write new on group0" (T342902)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [15:39:02] Would you like me to do any testing for this? [15:39:09] nah [15:39:27] we should only check that nothing explodes [15:40:00] !log kamila@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:40:01] !log kamila@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:40:16] !log kamila@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:40:19] !log kamila@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:40:45] (03CR) 10BCornwall: [C: 03+1] conftool-data: Duplicate labweb service as cloudweb [puppet] - 10https://gerrit.wikimedia.org/r/941458 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah) [15:41:18] Dreamy_Jazz: https://phabricator.wikimedia.org/P49765 [15:41:19] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10Fabfur) fifo-log-demux package is ready to be tested and eventually included in the bookworm repositories: https://gerrit.wikimedia.org/r/c/operations/software/fifo-log-demux/+/942414 [15:41:40] Hmm. Something was missed then in https://phabricator.wikimedia.org/T329203 [15:41:53] At least it wasn't directly my fault :) [15:42:02] it's the little things [15:42:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT certificates) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:42:24] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10Fabfur) [15:42:36] !log kamila@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:44:09] Dreamy_Jazz: this is the exact reason why I wanted to start from group0 before moving to group1 :-P [15:44:09] :) [15:44:17] Yup [15:44:21] Good point. [15:44:28] :) [15:44:29] (not the first time seeing a database that was missed) [15:44:40] !log kamila@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:45:06] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:942466|Revert "CheckUser event table migration: Write new on group0" (T342902)]] (duration: 07m 43s) [15:45:15] T342902: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'cuc_only_for_read_old' in 'field list'Function: MediaWiki\CheckUser\Hooks::onAuthManagerLoginAuthenticateAuditQuery: INSERT INTO `cu_changes` (cuc_page_id,cuc_namespac - https://phabricator.wikimedia.org/T342902 [15:45:17] duesen: I'll check PC [15:46:19] (03CR) 10Jbond: [C: 03+2] geoip: Add docs and fix lint issues [puppet] - 10https://gerrit.wikimedia.org/r/942452 (https://phabricator.wikimedia.org/T342878) (owner: 10Jbond) [15:46:22] (03CR) 10Jbond: [V: 03+1 C: 03+2] geoip: drop GeoIP2-Anonymous-IP from syncing [puppet] - 10https://gerrit.wikimedia.org/r/942453 (https://phabricator.wikimedia.org/T342878) (owner: 10Jbond) [15:47:00] kamila_: happy for me to merge your change [15:48:03] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10Fabfur) [15:48:12] jbond: sorry, I'm not sure what you mean? [15:49:43] !log kamila@deploy1002 helmfile [staging] START helmfile.d/services/benthos-cache-invalidator: apply [15:49:44] !log kamila@deploy1002 helmfile [staging] DONE helmfile.d/services/benthos-cache-invalidator: apply [15:49:54] !log restart db2097 [15:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:15] kamila_: you have submited this CR https://gerrit.wikimedia.org/r/c/operations/puppet/+/942440 but yuo have not merged it (puppet-merge) [15:50:27] im currently merging one of min are yuo happy for me to merge yours at the same time [15:50:28] oh yes! that would explain the failures! [15:50:32] yes please :D [15:50:41] (03PS2) 10Jforrester: Add wikifunctions.org to certspotter::monitor_domains [puppet] - 10https://gerrit.wikimedia.org/r/941972 (https://phabricator.wikimedia.org/T275945) [15:50:56] :) should be merged now [15:51:03] thank you :D [15:51:07] np [15:52:47] one day I'll stop forgetting... maybe... [15:53:53] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:09] !log kamila@deploy1002 helmfile [staging] START helmfile.d/services/benthos-cache-invalidator: apply [15:56:10] !log kamila@deploy1002 helmfile [staging] DONE helmfile.d/services/benthos-cache-invalidator: apply [16:00:05] jbond and rzl: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230727T1600). [16:00:05] James_F: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:48] 👋 I have an interview running slightly over, I'll be with you in a sec [16:00:56] Thanks. [16:01:01] I understand if my patch scares people and people don't want to deploy it. :-) [16:01:18] (03CR) 10JMeybohm: [C: 03+1] "Very cool, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/937534 (https://phabricator.wikimedia.org/T265633) (owner: 10Jbond) [16:01:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T342617)', diff saved to https://phabricator.wikimedia.org/P49766 and previous config saved to /var/cache/conftool/dbconfig/20230727-160132-ladsgroup.json [16:01:37] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [16:03:28] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10netops, and 2 others: Puppet interface:: resources aren't cleaned up - https://phabricator.wikimedia.org/T342899 (10aborrero) p:05Triage→03Low [16:04:25] (03CR) 10Jbond: puppetserver: Add a file to track when a service restart or reload is (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940365 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [16:06:59] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10Fabfur) started working on file-read-backwards package [16:08:29] (03CR) 10Jbond: [C: 03+2] puppetserver: make notifying configurable [puppet] - 10https://gerrit.wikimedia.org/r/939643 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [16:08:32] (03CR) 10Jbond: [C: 03+2] puppetserver: Add a file to track when a service restart or reload is [puppet] - 10https://gerrit.wikimedia.org/r/940365 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [16:08:40] (03CR) 10Jbond: [C: 03+2] motd: Add motd indicating services which need restarting [puppet] - 10https://gerrit.wikimedia.org/r/940366 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [16:15:08] (03PS1) 10AikoChou: ml-services: update ores-legacy docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/942455 (https://phabricator.wikimedia.org/T342663) [16:15:45] (03PS1) 10Jbond: motd: correct path [puppet] - 10https://gerrit.wikimedia.org/r/942456 [16:15:54] (03CR) 10Jbond: [V: 03+2 C: 03+2] motd: correct path [puppet] - 10https://gerrit.wikimedia.org/r/942456 (owner: 10Jbond) [16:16:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P49768 and previous config saved to /var/cache/conftool/dbconfig/20230727-161638-ladsgroup.json [16:18:28] (WidespreadPuppetFailure) firing: (2) Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:18:53] this was me it should clear shortly sorry oncallers [16:19:42] cdanis: cwhite: fyi can be safley ignored [16:20:39] thanks! [16:22:35] (03CR) 10JHathaway: [C: 03+1] "looks good, one question" [puppet] - 10https://gerrit.wikimedia.org/r/942389 (owner: 10Jbond) [16:23:28] (WidespreadPuppetFailure) resolved: (2) Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:25:32] James_F: sorry to be so behind! looking now [16:25:39] Thanks! [16:25:47] It's just the certmon one left. [16:26:48] (03CR) 10RLazarus: [C: 03+2] Add wikifunctions.org to certspotter::monitor_domains [puppet] - 10https://gerrit.wikimedia.org/r/941972 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [16:30:26] jouncebot: nowandnext [16:30:27] For the next 0 hour(s) and 29 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230727T1600) [16:30:27] In 0 hour(s) and 29 minute(s): Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230727T1700) [16:30:27] In 0 hour(s) and 29 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230727T1700) [16:30:52] (03CR) 10Ladsgroup: [C: 03+2] ores-extension: enable lw on itwiki and hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941955 (https://phabricator.wikimedia.org/T342115) (owner: 10Ilias Sarantopoulos) [16:31:03] (03CR) 10Jbond: [C: 03+2] vtrs: drop bashisms and fix other CI issues [puppet] - 10https://gerrit.wikimedia.org/r/940379 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond) [16:31:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941955 (https://phabricator.wikimedia.org/T342115) (owner: 10Ilias Sarantopoulos) [16:31:28] (03CR) 10Jgiannelos: "This patch only enables the change on staging. A question that was brought by the restbase sunset group is whether we should worry on the " [deployment-charts] - 10https://gerrit.wikimedia.org/r/939292 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos) [16:31:33] (03Merged) 10jenkins-bot: ores-extension: enable lw on itwiki and hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941955 (https://phabricator.wikimedia.org/T342115) (owner: 10Ilias Sarantopoulos) [16:31:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P49769 and previous config saved to /var/cache/conftool/dbconfig/20230727-163144-ladsgroup.json [16:31:48] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:941955|ores-extension: enable lw on itwiki and hewiki (T342115)]] [16:31:52] T342115: Deployment of Lift Wing usage to all wikis that use ores extension - https://phabricator.wikimedia.org/T342115 [16:33:02] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T342906 (10phaultfinder) [16:33:15] !log ladsgroup@deploy1002 isaranto and ladsgroup: Backport for [[gerrit:941955|ores-extension: enable lw on itwiki and hewiki (T342115)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [16:33:46] James_F: merged and applied [16:33:52] rzl: Thanks! [16:34:33] !log kamila@deploy1002 helmfile [staging] START helmfile.d/services/benthos-cache-invalidator: apply [16:34:34] (03CR) 10JHathaway: [C: 03+1] puppetserver: Add a file to track when a service restart or reload is (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940365 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [16:34:34] !log kamila@deploy1002 helmfile [staging] DONE helmfile.d/services/benthos-cache-invalidator: apply [16:36:55] (03PS1) 10Bking: flink-zk: allow analytics network [puppet] - 10https://gerrit.wikimedia.org/r/942457 (https://phabricator.wikimedia.org/T341792) [16:38:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:41:15] !log kamila@deploy1002 helmfile [staging] START helmfile.d/services/benthos-cache-invalidator: apply [16:41:16] !log kamila@deploy1002 helmfile [staging] DONE helmfile.d/services/benthos-cache-invalidator: apply [16:44:20] !log dancy@deploy1002 Installing scap version "4.57.0" for 600 hosts [16:45:23] !log dancy@deploy1002 Installation of scap version "4.57.0" completed for 600 hosts [16:45:51] 10SRE, 10Abstract Wikipedia team, 10serviceops, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) [16:46:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T342617)', diff saved to https://phabricator.wikimedia.org/P49770 and previous config saved to /var/cache/conftool/dbconfig/20230727-164650-ladsgroup.json [16:46:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [16:46:56] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [16:47:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [16:47:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T342617)', diff saved to https://phabricator.wikimedia.org/P49771 and previous config saved to /var/cache/conftool/dbconfig/20230727-164711-ladsgroup.json [16:48:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:52:42] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:941955|ores-extension: enable lw on itwiki and hewiki (T342115)]] (duration: 20m 53s) [16:52:46] T342115: Deployment of Lift Wing usage to all wikis that use ores extension - https://phabricator.wikimedia.org/T342115 [16:58:35] (03PS1) 10BryanDavis: developer-portal: Bump container version to 2023-07-27-112528-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/942459 (https://phabricator.wikimedia.org/T341501) [16:59:51] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container version to 2023-07-27-112528-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/942459 (https://phabricator.wikimedia.org/T341501) (owner: 10BryanDavis) [17:00:00] (03PS1) 10Jcrespo: Revert "mariadb: Disable notifications for db2097, db2141" [puppet] - 10https://gerrit.wikimedia.org/r/942467 [17:00:08] bd808: How many deployers does it take to do Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230727T1700). [17:00:08] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230727T1700) [17:00:13] (03CR) 10CI reject: [V: 04-1] Revert "mariadb: Disable notifications for db2097, db2141" [puppet] - 10https://gerrit.wikimedia.org/r/942467 (owner: 10Jcrespo) [17:00:36] (03Merged) 10jenkins-bot: developer-portal: Bump container version to 2023-07-27-112528-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/942459 (https://phabricator.wikimedia.org/T341501) (owner: 10BryanDavis) [17:00:47] (03CR) 10Jbond: [C: 03+2] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/940365 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [17:01:26] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:02:24] (03PS2) 10Jcrespo: Revert "mariadb: Disable notifications for db2097, db2141" [puppet] - 10https://gerrit.wikimedia.org/r/942467 [17:02:44] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:02:52] (03CR) 10Jcrespo: [C: 04-1] "Waiting on 10.6 upgrade." [puppet] - 10https://gerrit.wikimedia.org/r/942467 (owner: 10Jcrespo) [17:02:52] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:03:33] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:03:46] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:04:27] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:06:10] (03PS1) 10Jcrespo: mariadb: Upgrade db2097 to use mariadb 10.6 package [puppet] - 10https://gerrit.wikimedia.org/r/942461 (https://phabricator.wikimedia.org/T334650) [17:08:00] (03PS3) 10Jcrespo: Revert "mariadb: Disable notifications for db2097, db2141" [puppet] - 10https://gerrit.wikimedia.org/r/942467 [17:08:04] * bd808 is done deploying for today [17:09:04] (03CR) 10Jcrespo: [C: 04-1] "Waiting on 942461 & upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/942467 (owner: 10Jcrespo) [17:09:31] (03CR) 10Jcrespo: [C: 03+1] "Ready to me." [puppet] - 10https://gerrit.wikimedia.org/r/942461 (https://phabricator.wikimedia.org/T334650) (owner: 10Jcrespo) [17:25:20] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host rdb1013.eqiad.wmnet with OS bullseye [17:25:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host rdb1013.eqiad.wmnet with OS bullseye [17:26:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P49773 and previous config saved to /var/cache/conftool/dbconfig/20230727-172626-ladsgroup.json [17:35:38] (03CR) 10Jforrester: Add timeout values in milliseconds as environment variables. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/942017 (owner: 10Cory Massaro) [17:41:24] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb1013.eqiad.wmnet with reason: host reimage [17:41:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P49774 and previous config saved to /var/cache/conftool/dbconfig/20230727-174132-ladsgroup.json [17:44:29] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rdb1013.eqiad.wmnet with reason: host reimage [17:46:38] (03PS1) 10Jforrester: tests: Add some PHP testing on logos/config.yaml [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942463 [17:47:27] (03CR) 10CI reject: [V: 04-1] tests: Add some PHP testing on logos/config.yaml [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942463 (owner: 10Jforrester) [17:48:02] (03PS4) 10Jbond: ssh::known_hosts: add new known_hosts functions [puppet] - 10https://gerrit.wikimedia.org/r/942389 [17:48:45] (03CR) 10Jbond: ssh::known_hosts: add new known_hosts functions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/942389 (owner: 10Jbond) [17:56:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T342617)', diff saved to https://phabricator.wikimedia.org/P49775 and previous config saved to /var/cache/conftool/dbconfig/20230727-175638-ladsgroup.json [17:56:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance [17:56:43] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [17:56:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance [17:56:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1201 (T342617)', diff saved to https://phabricator.wikimedia.org/P49776 and previous config saved to /var/cache/conftool/dbconfig/20230727-175659-ladsgroup.json [17:59:38] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:00:07] jnuche and dancy: Dear deployers, time to do the MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230727T1800). [18:00:11] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Spicerack: add distributed locking support - https://phabricator.wikimedia.org/T341973 (10bking) [18:00:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:00:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host rdb1013.eqiad.wmnet with OS bullseye [18:00:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host rdb1013.eqiad.wmnet with OS bullseye completed: - rdb1013 (**PASS**) - Removed from P... [18:01:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10Jhancock.wm) [18:12:34] !log krinkle@deploy1002 Started deploy [performance/navtiming@c868e79]: Rename FID labels (Ibab7118be69e50bf), Remove QuickSurveys (T336169), Add Vietnam (T340714) [18:12:40] !log krinkle@deploy1002 Finished deploy [performance/navtiming@c868e79]: Rename FID labels (Ibab7118be69e50bf), Remove QuickSurveys (T336169), Add Vietnam (T340714) (duration: 00m 05s) [18:12:41] T336169: Stop collecting the Performance perception survey - https://phabricator.wikimedia.org/T336169 [18:12:41] T340714: Add Vietnam as one of the tagged countries in our navigation timing metrics from Prometheus - https://phabricator.wikimedia.org/T340714 [18:12:58] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Spicerack: add distributed locking support - https://phabricator.wikimedia.org/T341973 (10bking) Per email discussion with @jbond , adding another locking backend for consideration: Swift - PRO: Well-known/supported at WMF - PRO... [18:13:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host rdb1014.eqiad.wmnet with OS bullseye [18:14:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host rdb1014.eqiad.wmnet with OS bullseye [18:16:04] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Spicerack: add distributed locking support - https://phabricator.wikimedia.org/T341973 (10bking) [18:16:48] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Spicerack: add distributed locking support - https://phabricator.wikimedia.org/T341973 (10bking) [18:17:13] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Spicerack: add distributed locking support - https://phabricator.wikimedia.org/T341973 (10bking) [18:20:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T342617)', diff saved to https://phabricator.wikimedia.org/P49777 and previous config saved to /var/cache/conftool/dbconfig/20230727-182058-ladsgroup.json [18:21:04] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [18:28:38] (03PS5) 10Jbond: ssh::known_hosts: add new known_hosts functions [puppet] - 10https://gerrit.wikimedia.org/r/942389 [18:28:59] (03CR) 10Milimetric: [C: 03+1] "Merged the related refinery patch, but need to deploy it, so it won't be there until like 30 minutes from now." [puppet] - 10https://gerrit.wikimedia.org/r/939394 (https://phabricator.wikimedia.org/T342199) (owner: 10David Martin) [18:33:05] !log milimetric@deploy1002 Started deploy [analytics/refinery@1af57de]: Deploying to sync script updates and static files [18:36:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P49778 and previous config saved to /var/cache/conftool/dbconfig/20230727-183604-ladsgroup.json [18:41:10] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:41:30] !log milimetric@deploy1002 Finished deploy [analytics/refinery@1af57de]: Deploying to sync script updates and static files (duration: 08m 25s) [18:41:36] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:41:43] !log milimetric@deploy1002 Started deploy [analytics/refinery@1af57de] (thin): Deploying to sync script updates and static files [18:41:47] !log milimetric@deploy1002 Finished deploy [analytics/refinery@1af57de] (thin): Deploying to sync script updates and static files (duration: 00m 04s) [18:44:54] (03PS3) 10Cory Massaro: Add timeout values in milliseconds as environment variables. [deployment-charts] - 10https://gerrit.wikimedia.org/r/942017 [18:45:09] (03CR) 10Cory Massaro: "Thank you! I've now update the evaluator image, too." [deployment-charts] - 10https://gerrit.wikimedia.org/r/942017 (owner: 10Cory Massaro) [18:50:02] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:50:30] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:51:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P49779 and previous config saved to /var/cache/conftool/dbconfig/20230727-185110-ladsgroup.json [19:04:00] (03PS1) 10Fabfur: Bookworm release. Fix minor lintian warning about missing description. [debs/file-read-backwards] (debian) - 10https://gerrit.wikimedia.org/r/942491 [19:04:46] (03PS2) 10Fabfur: Bookworm release. Fix minor lintian warning about missing description. [debs/file-read-backwards] (debian) - 10https://gerrit.wikimedia.org/r/942491 (https://phabricator.wikimedia.org/T342154) [19:06:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T342617)', diff saved to https://phabricator.wikimedia.org/P49780 and previous config saved to /var/cache/conftool/dbconfig/20230727-190617-ladsgroup.json [19:06:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [19:06:22] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [19:06:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [19:06:37] (03CR) 10Fabfur: Bump target distribution to Bookworm (031 comment) [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/942414 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [19:06:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1213:3316 (T342617)', diff saved to https://phabricator.wikimedia.org/P49781 and previous config saved to /var/cache/conftool/dbconfig/20230727-190637-ladsgroup.json [19:06:52] (03PS1) 10Andrew Bogott: Update Horizon docker image version [puppet] - 10https://gerrit.wikimedia.org/r/942492 [19:07:39] (03CR) 10Andrew Bogott: [C: 03+2] Update Horizon docker image version [puppet] - 10https://gerrit.wikimedia.org/r/942492 (owner: 10Andrew Bogott) [19:08:48] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10Fabfur) file-read-backwards is ready to be tested: https://gerrit.wikimedia.org/r/c/operations/debs/file-read-backwards/+/942491 [19:15:05] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb1014.eqiad.wmnet with reason: host reimage [19:16:40] PROBLEM - Hadoop NodeManager on an-worker1085 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:17:12] PROBLEM - Check systemd state on an-worker1085 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:18:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rdb1014.eqiad.wmnet with reason: host reimage [19:18:37] PROBLEM - Hadoop NodeManager on an-worker1093 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:21:38] (03PS1) 10Bking: flink-zk: Enable prometheus scrapes [puppet] - 10https://gerrit.wikimedia.org/r/942494 (https://phabricator.wikimedia.org/T341792) [19:23:43] (03PS2) 10Bking: flink-zk: Enable prometheus scrapes [puppet] - 10https://gerrit.wikimedia.org/r/942494 (https://phabricator.wikimedia.org/T341792) [19:24:33] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/942494 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [19:27:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:27:59] RECOVERY - Hadoop NodeManager on an-worker1093 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:30:41] PROBLEM - Check systemd state on an-worker1091 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:32:29] PROBLEM - Hadoop NodeManager on an-worker1091 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:33:41] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [19:34:31] PROBLEM - Check systemd state on an-worker1129 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:34:41] PROBLEM - Hadoop NodeManager on an-worker1129 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:35:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [19:35:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host rdb1014.eqiad.wmnet with OS bullseye [19:35:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host rdb1014.eqiad.wmnet with OS bullseye completed: - rdb1014 (**PASS**) - Removed from P... [19:35:19] RECOVERY - Hadoop NodeManager on an-worker1085 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:36:01] RECOVERY - Check systemd state on an-worker1085 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:36:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10Jhancock.wm) [19:36:41] RECOVERY - Check systemd state on an-worker1091 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:37:23] RECOVERY - Hadoop NodeManager on an-worker1091 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:39:01] PROBLEM - Hadoop NodeManager on an-worker1115 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:39:35] RECOVERY - Check systemd state on an-worker1129 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:39:45] RECOVERY - Hadoop NodeManager on an-worker1129 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:46:44] (03CR) 10Cwhite: [C: 03+2] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/938326 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [19:48:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316 (T342617)', diff saved to https://phabricator.wikimedia.org/P49782 and previous config saved to /var/cache/conftool/dbconfig/20230727-194856-ladsgroup.json [19:49:01] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [19:51:03] PROBLEM - Check systemd state on an-worker1115 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:52:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:56:05] RECOVERY - Hadoop NodeManager on an-worker1115 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:56:55] RECOVERY - Check systemd state on an-worker1115 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:00:06] brennen and TheresNoTime: #bothumor I � Unicode. All rise for UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230727T2000). [20:00:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10Jhancock.wm) 05Stalled→03Resolved @akosiaris install is complete [20:02:57] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:03:39] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:04:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316', diff saved to https://phabricator.wikimedia.org/P49783 and previous config saved to /var/cache/conftool/dbconfig/20230727-200402-ladsgroup.json [20:07:59] nothing to deploy afaict. stepping afk. [20:18:11] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10RobH) [20:19:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316', diff saved to https://phabricator.wikimedia.org/P49784 and previous config saved to /var/cache/conftool/dbconfig/20230727-201908-ladsgroup.json [20:25:09] 10SRE: Bump Yarn logs retention period to support debugging long running jobs - https://phabricator.wikimedia.org/T342923 (10xcollazo) [20:25:36] 10SRE: Bump Yarn logs retention period to support debugging long running jobs - https://phabricator.wikimedia.org/T342923 (10xcollazo) @BTullis not sure about the tags for this one. Is it #sre ? [20:28:09] (03CR) 10Cory Massaro: "Okay, looks like this won't cause any explosions ..." [deployment-charts] - 10https://gerrit.wikimedia.org/r/942017 (owner: 10Cory Massaro) [20:30:06] (03Abandoned) 10Bking: flink-zk: allow analytics network [puppet] - 10https://gerrit.wikimedia.org/r/942457 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [20:34:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316 (T342617)', diff saved to https://phabricator.wikimedia.org/P49785 and previous config saved to /var/cache/conftool/dbconfig/20230727-203415-ladsgroup.json [20:34:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance [20:34:20] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [20:34:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance [20:34:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1224 (T342617)', diff saved to https://phabricator.wikimedia.org/P49786 and previous config saved to /var/cache/conftool/dbconfig/20230727-203435-ladsgroup.json [20:46:03] (03CR) 10Jforrester: [C: 03+1] Add timeout values in milliseconds as environment variables. [deployment-charts] - 10https://gerrit.wikimedia.org/r/942017 (owner: 10Cory Massaro) [20:46:29] (03CR) 10Jforrester: [C: 03+1] Create puppet scripting for sqooping Wikifunctions tables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/939394 (https://phabricator.wikimedia.org/T342199) (owner: 10David Martin) [20:50:09] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:57:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T342617)', diff saved to https://phabricator.wikimedia.org/P49787 and previous config saved to /var/cache/conftool/dbconfig/20230727-205744-ladsgroup.json [20:57:49] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [21:03:47] 10SRE-swift-storage, 10Commons: Server error 500 after uploading chunk - https://phabricator.wikimedia.org/T340917 (10Yann) These may not be in the public domain in UK, the source country, however they are in the public domain in USA. [21:05:00] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/942494 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [21:05:22] (03CR) 10Cwhite: [C: 03+2] Create puppet scripting for sqooping Wikifunctions tables [puppet] - 10https://gerrit.wikimedia.org/r/939394 (https://phabricator.wikimedia.org/T342199) (owner: 10David Martin) [21:12:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P49788 and previous config saved to /var/cache/conftool/dbconfig/20230727-211250-ladsgroup.json [21:27:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P49789 and previous config saved to /var/cache/conftool/dbconfig/20230727-212756-ladsgroup.json [21:43:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T342617)', diff saved to https://phabricator.wikimedia.org/P49790 and previous config saved to /var/cache/conftool/dbconfig/20230727-214302-ladsgroup.json [21:43:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [21:43:08] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [21:43:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [23:03:16] thcipriani: jnuche: can I do an emergency backport for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/942512 / T342927 ? CSS-only extension patch, should be very low risk. [23:03:19] T342927: [wmf.19-regression] Help panel - text displayed incorrectly - https://phabricator.wikimedia.org/T342927 [23:09:17] jouncebot: nowandnext [23:09:17] No deployments scheduled for the next 6 hour(s) and 50 minute(s) [23:09:17] In 6 hour(s) and 50 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230728T0600) [23:09:38] in my interpretation the nodeploy time hasn't started yet, it's still thursday in the us [23:29:11] the backport window was three hours ago though [23:39:26] tgr: sorry for the delay, sure [23:40:10] tgr: do you need someone to deploy? [23:44:46] thcipriani: no, thanks, can do it [23:45:32] (03PS1) 10Gergő Tisza: help: Fix navigation in the help panel [extensions/GrowthExperiments] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/942468 (https://phabricator.wikimedia.org/T342927) [23:45:49] ok, let me know if you need anything, otherwise approved from me. Seems low risk and looks broken. Thanks for backporting. [23:46:25] (03CR) 10Thcipriani: [C: 03+1] help: Fix navigation in the help panel [extensions/GrowthExperiments] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/942468 (https://phabricator.wikimedia.org/T342927) (owner: 10Gergő Tisza) [23:49:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/942468 (https://phabricator.wikimedia.org/T342927) (owner: 10Gergő Tisza)