[00:00:00] (03CR) 10Dzahn: "I don't see an Icinga contact named ssingh, as far as I remember this has to match a contact and an LDAP login, but not sure." [puppet] - 10https://gerrit.wikimedia.org/r/900509 (owner: 10Ssingh) [00:04:55] PROBLEM - Check systemd state on phab2002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:58] (03CR) 10Ssingh: icinga: add ssingh to cgi.cfg (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900509 (owner: 10Ssingh) [00:13:21] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:10:00 on lvs5006.eqsin.wmnet with reason: rebooting for kernel updates [00:13:36] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on lvs5006.eqsin.wmnet with reason: rebooting for kernel updates [00:21:34] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1105.eqiad.wmnet - https://phabricator.wikimedia.org/T331874 (10wiki_willy) a:03Jclark-ctr [00:26:19] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:10:00 on lvs2010.codfw.wmnet with reason: rebooting for kernel updates [00:26:34] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on lvs2010.codfw.wmnet with reason: rebooting for kernel updates [00:30:23] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:30:44] ^ expected [00:31:45] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:32:19] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 183, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:33:39] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:35:26] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:10:00 on lvs1020.eqiad.wmnet with reason: rebooting for kernel updates [00:35:41] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on lvs1020.eqiad.wmnet with reason: rebooting for kernel updates [00:37:29] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:05:22] !log sukhe@cumin2002 START - Cookbook sre.hosts.remove-downtime for lvs2010.codfw.wmnet [01:05:23] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs2010.codfw.wmnet [01:29:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:36:41] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (39) node(s) change every puppet run: cephosd1001, cephosd1002, cephosd1003, cephosd1004, cephosd1005, clouddumps1001, clouddumps1002, mw2420, mw2421, mw2422, mw2423, mw2424, mw2425, mw2426, mw2427, mw2428, mw2429, mw2430, mw2431, mw2432, mw2433, mw2434, mw2435, mw2436, mw2437, mw2438, mw2439, mw2440, mw2441, mw2442, mw [01:36:41] 2444, mw2445, mw2446, mw2447, mw2448, mw2449, mw2450, mw2451 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [01:41:31] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (39) node(s) change every puppet run: cephosd1001, cephosd1002, cephosd1003, cephosd1004, cephosd1005, clouddumps1001, clouddumps1002, mw2420, mw2421, mw2422, mw2423, mw2424, mw2425, mw2426, mw2427, mw2428, mw2429, mw2430, mw2431, mw2432, mw2433, mw2434, mw2435, mw2436, mw2437, mw2438, mw2439, mw2440, mw2441, mw2442, mw [01:41:31] 2444, mw2445, mw2446, mw2447, mw2448, mw2449, mw2450, mw2451 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [02:06:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:21:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:53:14] (03PS1) 10Marostegui: instances.yaml: Add db1106 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/900529 [05:54:02] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1106 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/900529 (owner: 10Marostegui) [05:56:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1106 to dbctl', diff saved to https://phabricator.wikimedia.org/P45887 and previous config saved to /var/cache/conftool/dbconfig/20230317-055643-marostegui.json [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230317T0600) [06:50:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:55:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230317T0700) [07:01:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:04:37] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:06:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:08:25] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:26:18] (03CR) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) (owner: 10Nicolas Fraison) [07:28:47] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [07:28:51] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [07:39:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:39:20] (03PS1) 10Stang: bewiki: Remove group "autoeditor", "reviewer" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900537 (https://phabricator.wikimedia.org/T326012) [07:44:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:44:23] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:44:59] (03CR) 10Dzahn: icinga: add ssingh to cgi.cfg (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900509 (owner: 10Ssingh) [07:57:55] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [07:57:58] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [08:06:02] 10SRE, 10SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10MoritzMuehlenhoff) @Ottomata @odimitrijevic With the approval sorted out this needs your approval, then. [08:07:08] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:23:03] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Spicerack, 10Datacenter-Switchover: switchdc should verify active/active DBs are read-write in both datacenters - https://phabricator.wikimedia.org/T287129 (10Marostegui) @Clement_Goubert @RLazarus I believe this can be closed? [08:29:41] (03PS1) 10Slyngshede: LDAP attribute editor [software/bitu] - 10https://gerrit.wikimedia.org/r/900621 (https://phabricator.wikimedia.org/T179463) [08:43:33] (03CR) 10Jaime Nuche: deployment_server: ensure Docker is installed (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche) [09:25:21] (03PS1) 10Filippo Giunchedi: traffic: use haproxy for EdgeTrafficDrop [alerts] - 10https://gerrit.wikimedia.org/r/900626 (https://phabricator.wikimedia.org/T309182) [09:26:03] (03CR) 10Filippo Giunchedi: "This effectively brings back the alert, I don't know if you are interested or not though! Let me know" [alerts] - 10https://gerrit.wikimedia.org/r/900626 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [09:38:02] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [09:38:05] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [09:44:12] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/900336 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron) [09:44:17] (03CR) 10Filippo Giunchedi: [C: 03+1] kafka-logging: bring up kafka-logging1004 with node id 1004 [puppet] - 10https://gerrit.wikimedia.org/r/900337 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron) [09:44:46] 10SRE, 10Machine-Learning-Team, 10serviceops-radar, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10elukey) a:05calbon→03None [09:45:14] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [09:45:17] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [09:46:23] 10SRE, 10Machine-Learning-Team, 10serviceops-radar, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10elukey) This task needs a bit more clarification, we already have an experimental model server for nsfw content. Putting back in "Unsorted" status so... [09:47:41] (03PS1) 10Filippo Giunchedi: o11y: deploy prometheus alerts to all instances [alerts] - 10https://gerrit.wikimedia.org/r/900628 (https://phabricator.wikimedia.org/T309182) [09:52:55] (03CR) 10Filippo Giunchedi: Add Debian packaging for 21.3.0 (031 comment) [software/librenms] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/674563 (https://phabricator.wikimedia.org/T278309) (owner: 10Filippo Giunchedi) [09:59:40] (03CR) 10David Caro: [C: 03+2] k8s: update harbor fqdn to the proxy url [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/900408 (owner: 10David Caro) [10:00:25] (03Merged) 10jenkins-bot: k8s: update harbor fqdn to the proxy url [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/900408 (owner: 10David Caro) [10:06:04] (03PS3) 10Phuedx: beta: $wgIPInfoGeoIP2Prefix -> $wgIPInfoGeoLite2Prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828482 [10:12:45] (03PS1) 10Dzahn: iegreview: add http blackbox monitor [puppet] - 10https://gerrit.wikimedia.org/r/900631 (https://phabricator.wikimedia.org/T327976) [10:14:57] (03CR) 10Dzahn: [C: 03+2] iegreview: add http blackbox monitor [puppet] - 10https://gerrit.wikimedia.org/r/900631 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [10:17:42] (03PS1) 10Dzahn: iegreview: body_regex_matches needs to be array, not string [puppet] - 10https://gerrit.wikimedia.org/r/900632 (https://phabricator.wikimedia.org/T327976) [10:17:52] (03CR) 10CI reject: [V: 04-1] iegreview: body_regex_matches needs to be array, not string [puppet] - 10https://gerrit.wikimedia.org/r/900632 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [10:18:03] (03PS2) 10Dzahn: iegreview: body_regex_matches needs to be array, not string [puppet] - 10https://gerrit.wikimedia.org/r/900632 (https://phabricator.wikimedia.org/T327976) [10:18:06] (03CR) 10Dzahn: [C: 03+2] iegreview: body_regex_matches needs to be array, not string [puppet] - 10https://gerrit.wikimedia.org/r/900632 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [10:24:19] (03PS1) 10Slyngshede: Remove cleanup command [software/bitu] - 10https://gerrit.wikimedia.org/r/900633 [10:28:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:30:47] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10elukey) Getting back to this after a while, since we now need to move to Bullseye. The last blocker is cqlsh running on py2 only, so what if we keep our version of Cassandra but we upgrade its py... [10:33:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:37:33] RECOVERY - Check systemd state on ms-be2067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:07] PROBLEM - Check systemd state on ms-be2067 is CRITICAL: CRITICAL - degraded: The following units failed: swift-object.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:50:25] PROBLEM - Host ms-be2067 is DOWN: PING CRITICAL - Packet loss = 100% [10:51:35] RECOVERY - Host ms-be2067 is UP: PING OK - Packet loss = 0%, RTA = 31.57 ms [10:52:23] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add BGP community to all k8s advertisments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [10:52:53] PROBLEM - Check systemd state on ms-be2067 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:53:13] (03PS1) 10Volans: setup.py: force dnspython from Bullseye [software/spicerack] - 10https://gerrit.wikimedia.org/r/900635 [10:53:15] (03PS1) 10Volans: service: improve check_dns_state validation check [software/spicerack] - 10https://gerrit.wikimedia.org/r/900636 [10:53:17] (03PS1) 10Volans: tox: make config compatible with tox 4.x [software/spicerack] - 10https://gerrit.wikimedia.org/r/900637 [10:56:50] (03CR) 10CI reject: [V: 04-1] tox: make config compatible with tox 4.x [software/spicerack] - 10https://gerrit.wikimedia.org/r/900637 (owner: 10Volans) [10:57:16] (03Merged) 10jenkins-bot: Add BGP community to all k8s advertisments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [11:00:33] RECOVERY - Check systemd state on ms-be2067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:35] PROBLEM - Check systemd state on cephosd1002 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mon@cephosd1002.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:02:06] !log akosiaris@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [11:03:11] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T331030 (10MatthewVernon) Thanks! [11:03:52] !log akosiaris@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [11:05:45] (03PS2) 10Volans: tox: make config compatible with tox 4.x [software/spicerack] - 10https://gerrit.wikimedia.org/r/900637 [11:07:12] (03CR) 10Dzahn: [C: 03+2] set target quarter for miscweb bullseye upgrade to 2023-1 [puppet] - 10https://gerrit.wikimedia.org/r/900463 (https://phabricator.wikimedia.org/T291916) (owner: 10Dzahn) [11:14:39] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops, 10Patch-For-Review: Optimize k8s same row traffic flows - https://phabricator.wikimedia.org/T328523 (10akosiaris) ` kosiaris@re0.cr1-codfw> show route receive-protocol bgp 10.192.0.195 detail inet.0: 904595 destinations, 1757164 routes (904... [11:16:00] !log akosiaris@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [11:16:25] !log akosiaris@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [11:32:07] 10SRE, 10SRE Observability, 10User-fgiunchedi: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10elukey) Opened https://github.com/benthosdev/benthos/issues/1806 to upstream. [11:32:51] (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/900641 (owner: 10Clément Goubert) [11:34:13] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40183/console" [puppet] - 10https://gerrit.wikimedia.org/r/900641 (owner: 10Clément Goubert) [11:39:04] (03PS1) 10David Caro: harbor: use external url for the proxies [puppet] - 10https://gerrit.wikimedia.org/r/900642 [11:39:28] (03CR) 10CI reject: [V: 04-1] harbor: use external url for the proxies [puppet] - 10https://gerrit.wikimedia.org/r/900642 (owner: 10David Caro) [11:39:58] (03PS2) 10David Caro: harbor: use external url for the proxies [puppet] - 10https://gerrit.wikimedia.org/r/900642 [11:40:41] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Spicerack, 10Datacenter-Switchover: switchdc should verify active/active DBs are read-write in both datacenters - https://phabricator.wikimedia.org/T287129 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert Since we removed all RW sections... [11:42:49] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/899885 (https://phabricator.wikimedia.org/T328872) (owner: 10Tim Starling) [11:47:03] RECOVERY - Check systemd state on cephosd1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:50:33] 10SRE: Training checklist runbook review (Sprint Week 2023-03) - https://phabricator.wikimedia.org/T332391 (10LSobanski) [11:52:41] PROBLEM - Check systemd state on cephosd1002 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mon@cephosd1002.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:54:34] (03CR) 10Clément Goubert: [C: 03+2] deployment_server: clean up older images using systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/900313 (https://phabricator.wikimedia.org/T329678) (owner: 10Jaime Nuche) [11:56:25] (03CR) 10Clément Goubert: [C: 03+2] docker::gc: update configuration to use latest version of images [puppet] - 10https://gerrit.wikimedia.org/r/900312 (owner: 10Jaime Nuche) [12:02:09] PROBLEM - Check systemd state on gitlab-runner2002 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:02:09] PROBLEM - Check systemd state on gitlab-runner1004 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:02:28] that's my fault [12:02:31] checking [12:02:41] PROBLEM - Check systemd state on gitlab-runner1003 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:03:00] claime: thanks for merging that [12:03:10] Yeah, but I need to fix it now :D [12:03:12] Mar 17 11:58:16 gitlab-runner2002 docker[4172758]: ValueError: time data '2023-03-17T11:5800Z' does not match format '%Y-%m-%dT%H:%M:%S%z' [12:03:15] PROBLEM - Check systemd state on gitlab-runner1002 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:03:17] PROBLEM - Check systemd state on gitlab-runner2004 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:03:23] PROBLEM - Check systemd state on gitlab-runner2003 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:03:34] I'll revert that patch for now [12:03:57] (03PS1) 10Clément Goubert: Revert "docker::gc: update configuration to use latest version of images" [puppet] - 10https://gerrit.wikimedia.org/r/900561 [12:04:08] claime: +1, let's revert and leave a comment, tx [12:04:26] (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] Revert "docker::gc: update configuration to use latest version of images" [puppet] - 10https://gerrit.wikimedia.org/r/900561 (owner: 10Clément Goubert) [12:04:43] (03CR) 10Dzahn: [C: 03+1] "ValueError: time data '2023-03-17T11:5800Z' does not match format '%Y-%m-%dT%H:%M:%S%z'" [puppet] - 10https://gerrit.wikimedia.org/r/900561 (owner: 10Clément Goubert) [12:05:41] runs puppet on gitlab-runners [12:06:31] RECOVERY - Check systemd state on gitlab-runner1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:06:39] !log systemct-reset failed on gitlab-runner* [12:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:03] RECOVERY - Check systemd state on gitlab-runner1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:07:05] RECOVERY - Check systemd state on gitlab-runner2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:07:13] RECOVERY - Check systemd state on gitlab-runner2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:07:55] RECOVERY - Check systemd state on gitlab-runner2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:07:55] RECOVERY - Check systemd state on gitlab-runner1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:08:06] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10MatthewVernon) [12:09:00] (03Abandoned) 10Ayounsi: Allow AS loops in eqiad staging k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/886328 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [12:09:21] (03PS13) 10Ayounsi: Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) [12:14:13] (03CR) 10CI reject: [V: 04-1] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [12:24:17] (03PS1) 10Clément Goubert: cpufrequtils: Force reload init script on change [puppet] - 10https://gerrit.wikimedia.org/r/900645 [12:30:47] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40184/console" [puppet] - 10https://gerrit.wikimedia.org/r/900645 (owner: 10Clément Goubert) [12:31:25] PROBLEM - Check systemd state on cephosd1003 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1003.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:21] PROBLEM - Check systemd state on cephosd1005 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1005.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:39:38] (03PS1) 10Andrew Bogott: rbd2backy2: check if expiration is empty before parsing [puppet] - 10https://gerrit.wikimedia.org/r/900647 [12:39:40] (03PS1) 10Andrew Bogott: rbd2backy2: clean up a debugging line [puppet] - 10https://gerrit.wikimedia.org/r/900648 [12:42:06] (03CR) 10Cathal Mooney: [C: 03+1] Add cloudsw1-b1-codfw to Rancid [puppet] - 10https://gerrit.wikimedia.org/r/900448 (https://phabricator.wikimedia.org/T327919) (owner: 10Ayounsi) [12:42:23] (03CR) 10Cathal Mooney: [C: 03+2] Add cloudsw1-b1-codfw to Rancid [puppet] - 10https://gerrit.wikimedia.org/r/900448 (https://phabricator.wikimedia.org/T327919) (owner: 10Ayounsi) [12:42:51] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/900641 (owner: 10Clément Goubert) [12:43:12] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] facter/raid.rb: Add MegaRAID Tri-Mode SAS3516 [puppet] - 10https://gerrit.wikimedia.org/r/900641 (owner: 10Clément Goubert) [12:44:54] (03CR) 10Andrew Bogott: [C: 03+2] rbd2backy2: check if expiration is empty before parsing [puppet] - 10https://gerrit.wikimedia.org/r/900647 (owner: 10Andrew Bogott) [12:59:35] 10SRE, 10ops-eqiad: Remove second links from cloud servers - https://phabricator.wikimedia.org/T331737 (10Jclark-ctr) All physical cables have been removed. Netbox has not been updated yet [13:09:21] (03PS4) 10Slyngshede: SSH Keymanagement, allow user to manage ssh keys. [software/bitu] - 10https://gerrit.wikimedia.org/r/899519 [13:12:06] RECOVERY - Check systemd state on cephosd1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:16:18] PROBLEM - Check systemd state on cephosd1005 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1005.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:21:10] !log Depooling parse2004.codfw.wmnet for broken PSU - T332119 [13:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:15] T332119: Broken PSU on parse2004 - https://phabricator.wikimedia.org/T332119 [13:21:15] !log cgoubert@cumin1001 conftool action : set/pooled=inactive; selector: name=parse2004.codfw.wmnet [13:24:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10Jclark-ctr) cloudvirtlocal101001 D5 U16 PORT 36 CABLEID 5248 cloudvirtlocal101002 E4 U20 PORT 33 CABLEID 20330078 cloudvirtlocal101003 F4 U20... [13:25:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10Jclark-ctr) [13:34:19] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: Broken PSU on parse2004 - https://phabricator.wikimedia.org/T332119 (10Clement_Goubert) a:03Papaul [13:34:47] (03PS1) 10Jaime Nuche: docker::gc: update configuration to use latest version of images [puppet] - 10https://gerrit.wikimedia.org/r/900651 [13:34:56] 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: Broken PSU on parse2004 - https://phabricator.wikimedia.org/T332119 (10Clement_Goubert) [13:36:45] (03PS2) 10Jaime Nuche: docker::gc: update configuration to use latest version of images [puppet] - 10https://gerrit.wikimedia.org/r/900651 [13:37:43] (03PS3) 10Jaime Nuche: docker::gc: update configuration to use latest version of images [puppet] - 10https://gerrit.wikimedia.org/r/900651 [13:44:38] (03CR) 10Clément Goubert: [C: 03+2] docker::gc: update configuration to use latest version of images [puppet] - 10https://gerrit.wikimedia.org/r/900651 (owner: 10Jaime Nuche) [13:48:18] (ProbeDown) firing: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:51:21] !log bking@cumin1001 START - Cookbook sre.wdqs.restart [13:51:21] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [13:51:27] !log bking@cumin1001 START - Cookbook sre.wdqs.restart [13:51:27] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [13:51:52] !log bking@cumin1001 START - Cookbook sre.wdqs.restart [13:53:18] (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:55:41] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [13:57:04] !log bking@cumin1001 START - Cookbook sre.wdqs.restart [13:57:04] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [13:57:28] (03PS1) 10Jforrester: rdbms: Add missing QUERY_CHANGE_ flag to internal "USE" query [core] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900667 (https://phabricator.wikimedia.org/T332228) [13:57:48] !log bking@cumin1001 START - Cookbook sre.wdqs.restart [13:58:52] 10SRE, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Yak Shaving 🐃🪒): Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10hashar) Gerrit is going to remove the robot comments entirely: https://groups.google.com/g/repo-discuss... [13:59:36] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ms-fe1013.eqiad.wmnet with OS bullseye [13:59:43] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-fe1013.eqiad.wmnet with OS bullseye [13:59:44] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-fe1013.eqiad.wmnet with OS bullseye [13:59:48] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-fe1013.eqiad.wmnet with OS bullseye executed with... [14:05:11] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [14:06:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q3:rack/setup/install gerrit1003 - https://phabricator.wikimedia.org/T326366 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [14:09:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10Jclark-ctr) a:03Cmjohnson [14:13:38] !log bking@cumin1001 START - Cookbook sre.wdqs.restart [14:16:00] (03Abandoned) 10Jforrester: deployment-prep: remove php 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/836945 (owner: 10Jforrester) [14:23:45] PROBLEM - mediawiki-installation DSH group on parse2004 is CRITICAL: Host parse2004 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [14:27:52] 10SRE, 10ops-eqiad: Remove second links from cloud servers - https://phabricator.wikimedia.org/T331737 (10Jclark-ctr) 05Open→03Resolved Removed second link in Netbox [14:35:00] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [14:41:59] RECOVERY - Check systemd state on cephosd1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:47:43] PROBLEM - Check systemd state on cephosd1005 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1005.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:54:30] !log bking@cumin1001 START - Cookbook sre.wdqs.restart [14:54:30] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [14:55:00] !log bking@cumin1001 START - Cookbook sre.wdqs.restart [14:55:00] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [14:55:21] !log bking@cumin1001 START - Cookbook sre.wdqs.restart [15:10:30] (03PS1) 10Hashar: contint: allow CORS header for Zuul change status [puppet] - 10https://gerrit.wikimedia.org/r/900663 (https://phabricator.wikimedia.org/T214068) [15:12:45] (03CR) 10Hashar: "A similar change got made in integration/docroot which I use to simulate requests to https://integration.wikimedia.org/zuul/status/change/" [puppet] - 10https://gerrit.wikimedia.org/r/900663 (https://phabricator.wikimedia.org/T214068) (owner: 10Hashar) [15:24:01] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [15:28:08] 10ops-eqiad, 10cloud-services-team (Hardware): cloudcephosd1025: power supply temperature critical - https://phabricator.wikimedia.org/T332406 (10aborrero) [15:29:42] !log bking@cumin1001 START - Cookbook sre.wdqs.restart [15:29:47] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudcephosd1025: power supply temperature critical - https://phabricator.wikimedia.org/T332406 (10aborrero) [15:30:26] 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: Broken PSU on parse2004 - https://phabricator.wikimedia.org/T332119 (10Papaul) a:05Papaul→03Jhancock.wm [15:35:48] (03PS1) 10Jaime Nuche: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900687 [15:37:25] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) [15:37:42] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) [15:38:03] sukhe, urandom, nothing to report oncall-wise! Have a good weekend [15:39:16] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10Eevans) >>! In T310980#8704793, @elukey wrote: > Getting back to this after a while, since we now need to move to Bullseye. The last blocker is cqlsh running on py2 only, so what if we keep our v... [15:41:28] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10elukey) okok this is the part that I wasn't unclear about - we'd just deploy cqlsh in another way, like via puppet, and leverage the /usr/local precedence right? If so this could be something to... [15:43:17] (03PS1) 10Stang: kuwiktionary: Add wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900689 (https://phabricator.wikimedia.org/T326067) [15:44:05] (03Abandoned) 10Jaime Nuche: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900687 (owner: 10Jaime Nuche) [15:46:03] RECOVERY - IPMI Sensor Status on parse2004 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:47:43] 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: Broken PSU on parse2004 - https://phabricator.wikimedia.org/T332119 (10Jhancock.wm) a:05Jhancock.wm→03Clement_Goubert the physical PSU is showing as up and the server does not have an amber warning light on. replaced PSU from decommed server. alert... [15:49:21] thanks Xi! [15:49:35] er Xionix [15:49:51] autocomplete fail next level [15:50:52] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [15:51:39] (03PS1) 10Stang: slwiki: Create Draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900690 (https://phabricator.wikimedia.org/T332351) [16:03:02] (03PS1) 10Jaime Nuche: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/900692 [16:05:30] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10Eevans) >>! In T310980#8705537, @elukey wrote: > okok this is the part that I wasn't unclear about - we'd just deploy cqlsh in another way, like via puppet, and leverage the /usr/local precedence... [16:17:23] (03PS2) 10Jaime Nuche: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/900692 [16:20:25] (03PS1) 10Btullis: Update the location of the mgr and mon keyrings for the new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/900694 (https://phabricator.wikimedia.org/T330149) [16:20:56] (03CR) 10Jaime Nuche: [C: 03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/900692 (owner: 10Jaime Nuche) [16:21:45] (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/900692 (owner: 10Jaime Nuche) [16:25:20] (03PS2) 10Btullis: Update the location of the mgr and mon keyrings for the new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/900694 (https://phabricator.wikimedia.org/T330149) [16:26:17] (03CR) 10Btullis: [C: 03+2] Update the location of the mgr and mon keyrings for the new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/900694 (https://phabricator.wikimedia.org/T330149) (owner: 10Btullis) [16:28:33] RECOVERY - Check systemd state on cephosd1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:30:40] 10SRE-swift-storage, 10Thumbor, 10Platform Team Workboards (Platform Engineering Reliability): Thumbor 404s on an auth failure to Swift - https://phabricator.wikimedia.org/T332210 (10MatthewVernon) [16:33:08] 10SRE, 10SRE-swift-storage, 10Commons, 10Thumbor: Upstream caches: 404 - https://phabricator.wikimedia.org/T331820 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon I'm closing this task now, as I think thumbnails are now being correctly generated. [16:33:17] (03PS1) 10Zoranzoki21: [WIP] Remove FlaggedRevs from ptwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900696 (https://phabricator.wikimedia.org/T331762) [16:34:15] PROBLEM - Check systemd state on cephosd1005 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1005.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:34:19] 10SRE-swift-storage, 10Commons, 10Patch-For-Review, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10MatthewVernon) I think this issue is resolved now? Swift error rates have remained... [16:34:55] RECOVERY - Check systemd state on cephosd1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:36:21] (03CR) 10Zoranzoki21: "This is marked as a WIP because I'm sure that this needs more work." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900696 (https://phabricator.wikimedia.org/T331762) (owner: 10Zoranzoki21) [16:39:51] PROBLEM - Check systemd state on cephosd1004 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1004.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:40:41] PROBLEM - Check systemd state on cephosd1002 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1002.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:41:48] (03PS1) 10Btullis: Correct an error in the location of the ceph mon and mgr keyrings [puppet] - 10https://gerrit.wikimedia.org/r/900697 (https://phabricator.wikimedia.org/T330149) [16:43:15] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40185/console" [puppet] - 10https://gerrit.wikimedia.org/r/900697 (https://phabricator.wikimedia.org/T330149) (owner: 10Btullis) [16:44:14] (03CR) 10Btullis: [V: 03+1 C: 03+2] Correct an error in the location of the ceph mon and mgr keyrings [puppet] - 10https://gerrit.wikimedia.org/r/900697 (https://phabricator.wikimedia.org/T330149) (owner: 10Btullis) [16:51:41] Hi, I would like to know what's the status of several tasks related to creating wiki replica, like T327841 [16:51:41] T327841: Prepare and check storage layer for gurwiki - https://phabricator.wikimedia.org/T327841 [17:01:24] 10SRE, 10SRE-swift-storage, 10Commons, 10Thumbor: Upstream caches: 404 - https://phabricator.wikimedia.org/T331820 (10doctaxon) 05Resolved→03Open I'm reopening this task now, as I think thumbnails are still missing: * https://upload.wikimedia.org/wikipedia/commons/thumb/5/5a/%C4%8Cern%C3%A1_Hora-pah%C... [17:02:51] (03PS1) 10Btullis: Remove the specified location of the mon keys [puppet] - 10https://gerrit.wikimedia.org/r/900699 (https://phabricator.wikimedia.org/T330149) [17:03:18] (03CR) 10BCornwall: "Hm, not sure about this one: Added Sukhbir, which removed HAProxyEdgeTrafficDrop entirely (I22bdd22fc67f34624d8e000299f85c23ad60248c / htt" [alerts] - 10https://gerrit.wikimedia.org/r/900626 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [17:04:36] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40186/console" [puppet] - 10https://gerrit.wikimedia.org/r/900699 (https://phabricator.wikimedia.org/T330149) (owner: 10Btullis) [17:05:01] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on lvs5004.eqsin.wmnet with reason: rebooting for kernel updates [17:05:15] (03CR) 10Btullis: [V: 03+1 C: 03+2] Remove the specified location of the mon keys [puppet] - 10https://gerrit.wikimedia.org/r/900699 (https://phabricator.wikimedia.org/T330149) (owner: 10Btullis) [17:05:16] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lvs5004.eqsin.wmnet with reason: rebooting for kernel updates [17:05:54] please ignore BGP alerts in eqsin (I am on on-call and keeping an eye out) [17:06:55] RECOVERY - Check systemd state on cephosd1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:01] (03CR) 10Ssingh: traffic: use haproxy for EdgeTrafficDrop (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/900626 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [17:11:47] PROBLEM - Persistent high iowait on labstore1004 is CRITICAL: 10.27 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [17:12:17] RECOVERY - Check systemd state on cephosd1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:37] PROBLEM - Check systemd state on cephosd1001 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1001.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:15] PROBLEM - Disk space on dumpsdata1003 is CRITICAL: DISK CRITICAL - free space: /data 876054 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops [17:13:43] RECOVERY - Persistent high iowait on labstore1004 is OK: (C)10 ge (W)5 ge 2.902 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [17:18:01] PROBLEM - Check systemd state on cephosd1005 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1005.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:18:44] (03PS1) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) [17:22:09] 10SRE, 10SRE-swift-storage, 10Commons, 10Thumbor: Upstream caches: 404 - https://phabricator.wikimedia.org/T331820 (10Aklapper) 05Open→03Resolved That's not a 404 error but a 429 error, thus a different issue. [17:25:50] (03CR) 10MVernon: "Sorry, one further thing - some of these panels include the thanos-frontends, which isn't what we want here (while Thanos does run swift, " [puppet] - 10https://gerrit.wikimedia.org/r/899885 (https://phabricator.wikimedia.org/T328872) (owner: 10Tim Starling) [17:27:17] (03PS1) 10Cwhite: mediawiki: provision statsd_exporter on canary_appserver [puppet] - 10https://gerrit.wikimedia.org/r/900706 (https://phabricator.wikimedia.org/T240685) [17:29:11] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on lvs4008.ulsfo.wmnet with reason: rebooting for kernel updates [17:29:27] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lvs4008.ulsfo.wmnet with reason: rebooting for kernel updates [17:29:48] please ignore BGP alerts basically everywhere :P (I am on on-call and keeping an eye out) [17:30:03] (doing a mix of LVS reboots and hence) [17:30:48] (03CR) 10Ssingh: icinga: add ssingh to cgi.cfg (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900509 (owner: 10Ssingh) [17:31:51] !log sukhe@cumin2002 START - Cookbook sre.hosts.remove-downtime for lvs5004.eqsin.wmnet [17:31:52] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs5004.eqsin.wmnet [17:33:47] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:34:17] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:35:36] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on lvs6001.drmrs.wmnet with reason: rebooting for kernel updates [17:35:51] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lvs6001.drmrs.wmnet with reason: rebooting for kernel updates [17:40:05] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:43:41] RECOVERY - Check systemd state on cephosd1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:44:51] (03CR) 10Volans: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [17:45:14] (03CR) 10CI reject: [V: 04-1] Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [17:49:11] (03PS1) 10Giuseppe Lavagetto: trafficserver: make routing to mw on k8s more manageable [puppet] - 10https://gerrit.wikimedia.org/r/900704 (https://phabricator.wikimedia.org/T331318) [17:49:21] PROBLEM - Check systemd state on cephosd1002 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1002.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:49:47] (03PS2) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) [17:51:05] (03CR) 10CI reject: [V: 04-1] trafficserver: make routing to mw on k8s more manageable [puppet] - 10https://gerrit.wikimedia.org/r/900704 (https://phabricator.wikimedia.org/T331318) (owner: 10Giuseppe Lavagetto) [17:54:55] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 107, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:55:23] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:59:13] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:04:31] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on lvs2007.codfw.wmnet with reason: rebooting for kernel updates [18:04:46] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lvs2007.codfw.wmnet with reason: rebooting for kernel updates [18:08:45] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:09:55] !log fab@deploy2002 Started deploy [airflow-dags/research@5edcd7b]: (no justification provided) [18:10:11] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:10:14] !log fab@deploy2002 Finished deploy [airflow-dags/research@5edcd7b]: (no justification provided) (duration: 00m 19s) [18:12:25] 10SRE, 10SRE-swift-storage, 10Commons, 10Thumbor: Upstream caches: 404 - https://phabricator.wikimedia.org/T331820 (10doctaxon) @Aklapper which bug task belongs to this 429 error issue? [18:27:11] (03CR) 10BCornwall: traffic: use haproxy for EdgeTrafficDrop (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/900626 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [18:29:31] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:30:01] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 183, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:31:51] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:40:00 on lvs1017.eqiad.wmnet with reason: rebooting for kernel updates [18:32:06] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on lvs1017.eqiad.wmnet with reason: rebooting for kernel updates [18:34:29] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on lvs5005.eqsin.wmnet with reason: rebooting for kernel updates [18:34:44] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lvs5005.eqsin.wmnet with reason: rebooting for kernel updates [18:35:26] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on lvs6002.drmrs.wmnet with reason: rebooting for kernel updates [18:35:41] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lvs6002.drmrs.wmnet with reason: rebooting for kernel updates [18:36:18] 10SRE, 10Commons, 10MediaWiki-File-management, 10StructuredDataOnCommons, and 3 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10doctaxon) I get this 429 error with <50 thumbnails, nearly every day. Is this a part of... [18:36:25] (03CR) 10Jameel Kaisar: "test" [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [18:37:11] (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [18:37:13] 10SRE, 10SRE-swift-storage, 10Commons, 10Thumbor: Upstream caches: 404 - https://phabricator.wikimedia.org/T331820 (10Aklapper) @doctaxon: [Not sure](https://phabricator.wikimedia.org/maniphest/query/MRs9glgkf2A0/); [please search](https://www.mediawiki.org/wiki/Phabricator/Help#Searching_for_items). [18:37:15] 10SRE, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10BCornwall) 05Open→03Stalled [18:39:11] RECOVERY - Check systemd state on cephosd1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:39:43] PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:39:47] ^ expected [18:43:15] RECOVERY - Check systemd state on cephosd1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:44:49] PROBLEM - Check systemd state on cephosd1001 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1001.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:48:15] (03PS1) 10Ahmon Dancy: profile::mediawiki::deployment::server: Don't pass HELM_* vars to train presync [puppet] - 10https://gerrit.wikimedia.org/r/900731 (https://phabricator.wikimedia.org/T331479) [18:48:55] PROBLEM - Check systemd state on cephosd1002 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1002.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:49:55] (03CR) 10Albertoleoncio: "+comments" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900696 (https://phabricator.wikimedia.org/T331762) (owner: 10Zoranzoki21) [18:53:15] (03CR) 10Dzahn: "your Icinga contact in the private repo is "Sukhbir Singh" with also "Sukhbir Singh" as alias. try renaming it to "ssingh", with alias "Su" [puppet] - 10https://gerrit.wikimedia.org/r/900509 (owner: 10Ssingh) [18:54:00] (03CR) 10Dzahn: [C: 03+1] "you should merge this, but after you renamed the contact in the private repo" [puppet] - 10https://gerrit.wikimedia.org/r/900509 (owner: 10Ssingh) [19:03:47] 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: Broken PSU on parse2004 - https://phabricator.wikimedia.org/T332119 (10Papaul) 05Open→03Resolved Icinga checks for the psu's are all green .We can resolve the task. [19:05:03] PROBLEM - pybal on lvs5005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [19:05:47] PROBLEM - PyBal connections to etcd on lvs6002 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [19:05:51] PROBLEM - PyBal backends health check on lvs5005 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [19:06:42] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@7d75578]: enable templating of ores threshold fetch [19:06:43] PROBLEM - PyBal backends health check on lvs6002 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [19:06:55] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@7d75578]: enable templating of ores threshold fetch (duration: 00m 13s) [19:06:59] PROBLEM - pybal on lvs6002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [19:09:01] PROBLEM - PyBal connections to etcd on lvs5005 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [19:12:53] PROBLEM - pybal on lvs1017 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [19:13:37] PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [19:14:25] PROBLEM - PyBal connections to etcd on lvs1017 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [19:15:23] ^ all ok [19:15:31] resolving soon, the downtime expired :) [19:15:33] RECOVERY - PyBal backends health check on lvs1017 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:15:42] sukhe: ! thanks:) [19:16:09] RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:16:23] RECOVERY - PyBal backends health check on lvs6002 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:16:37] RECOVERY - pybal on lvs6002 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [19:16:41] RECOVERY - pybal on lvs5005 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [19:16:43] RECOVERY - pybal on lvs1017 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [19:16:54] here we go [19:17:29] RECOVERY - PyBal backends health check on lvs5005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:17:37] 👍 [19:17:37] RECOVERY - PyBal connections to etcd on lvs6002 is OK: OK: 4 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [19:18:45] RECOVERY - Check systemd state on cephosd1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:20:19] RECOVERY - PyBal connections to etcd on lvs1017 is OK: OK: 12 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [19:20:51] RECOVERY - PyBal connections to etcd on lvs5005 is OK: OK: 4 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [19:24:31] PROBLEM - Check systemd state on cephosd1003 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1003.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:30:58] (03CR) 10Herron: [C: 03+1] o11y: deploy prometheus alerts to all instances [alerts] - 10https://gerrit.wikimedia.org/r/900628 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [19:35:51] (03CR) 10Herron: [V: 03+2 C: 03+2] add tox json(net) linting and address issues raised [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/896408 (https://phabricator.wikimedia.org/T331659) (owner: 10Herron) [19:52:19] !log Testing Mastodon account changes. This should post to @wikimedia_sal@botsin.space [19:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:18] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@4aeffc6]: improve handling of ores threshold fetching [19:53:31] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@4aeffc6]: improve handling of ores threshold fetching (duration: 00m 13s) [19:59:42] (03PS1) 10Dzahn: site: add miscweb1003 to miscweb role [puppet] - 10https://gerrit.wikimedia.org/r/900739 (https://phabricator.wikimedia.org/T331896) [20:01:01] (03PS1) 10Ebernhardson: airflow-search: Update sudo rules for instanced services [puppet] - 10https://gerrit.wikimedia.org/r/900740 [20:01:49] (03CR) 10Ssingh: [C: 03+2] icinga: add ssingh to cgi.cfg [puppet] - 10https://gerrit.wikimedia.org/r/900509 (owner: 10Ssingh) [20:03:36] (03CR) 10Bking: [C: 03+2] airflow-search: Update sudo rules for instanced services [puppet] - 10https://gerrit.wikimedia.org/r/900740 (owner: 10Ebernhardson) [20:04:53] mutante: thank you! I was definitely missing what you suggested for the Icinga permissions [20:04:56] <3 [20:07:49] sukhe: :) hope it works [20:07:56] (03PS1) 10Dzahn: peopleweb: add monitor for people.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/900741 (https://phabricator.wikimedia.org/T329587) [20:08:58] sukhe: run puppet on alert1001 and "icinga -v /etc/icinga/icinga.cfg" to double check it likes the config [20:09:17] then try again when logged in as "ssingh" [20:10:32] (03PS1) 10Stang: trwikivoyage: Update wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900742 (https://phabricator.wikimedia.org/T332439) [20:15:06] mutante: it did work! [20:15:27] this was the missing bit, not sure how it got skipped but yeah [20:15:37] sukhe: :) [20:20:05] (03PS1) 10Dzahn: wdqs: add monitor for query. and query-preview.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/900743 (https://phabricator.wikimedia.org/T329587) [20:39:19] RECOVERY - Check systemd state on cephosd1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:45:03] PROBLEM - Check systemd state on cephosd1001 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1001.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:01:19] 10SRE-swift-storage, 10Commons, 10Patch-For-Review, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10TheDJ) @MatthewVernon I think so. Although it guess it might be worth evaluating i... [21:13:21] RECOVERY - Check systemd state on cephosd1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:16:18] 10SRE, 10Commons, 10MediaWiki-File-management, 10StructuredDataOnCommons, and 3 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10TheDJ) @doctaxon There are basically 2 main causes for 429 errors, but both have the sam... [21:19:05] PROBLEM - Check systemd state on cephosd1002 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1002.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:30:21] (03PS6) 10Aaron Schulz: DNM: add per-action component-level profiling in statsd using excimer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893839 (https://phabricator.wikimedia.org/T225968) [21:30:32] (03CR) 10Aaron Schulz: DNM: add per-action component-level profiling in statsd using excimer (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893839 (https://phabricator.wikimedia.org/T225968) (owner: 10Aaron Schulz) [21:38:49] RECOVERY - Check systemd state on cephosd1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:44:37] PROBLEM - Check systemd state on cephosd1001 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1001.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:46:11] (03CR) 10Krinkle: DNM: add per-action component-level profiling in statsd using excimer (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893839 (https://phabricator.wikimedia.org/T225968) (owner: 10Aaron Schulz) [21:48:55] RECOVERY - Check systemd state on cephosd1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:54:41] PROBLEM - Check systemd state on cephosd1003 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1003.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:07:05] RECOVERY - Check systemd state on cephosd1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:09:31] RECOVERY - Check systemd state on cephosd1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:12:51] PROBLEM - Check systemd state on cephosd1004 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1004.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:15:19] PROBLEM - Check systemd state on cephosd1001 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1001.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:43:37] RECOVERY - Check systemd state on cephosd1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:49:25] PROBLEM - Check systemd state on cephosd1002 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1002.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:09:07] RECOVERY - Check systemd state on cephosd1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:12:55] RECOVERY - Check systemd state on cephosd1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:14:53] PROBLEM - Check systemd state on cephosd1001 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1001.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:18:43] PROBLEM - Check systemd state on cephosd1005 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1005.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:37:19] RECOVERY - Check systemd state on cephosd1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:39:43] RECOVERY - Check systemd state on cephosd1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:43:03] PROBLEM - Check systemd state on cephosd1004 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1004.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:45:27] PROBLEM - Check systemd state on cephosd1001 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1001.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state