[00:05:56] (03CR) 10Jbond: profile::mirrors: move mirrors module into profiles (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/767889 (https://phabricator.wikimedia.org/T300985) (owner: 10JHathaway) [00:10:52] (03CR) 10Jbond: "from a quick look seems good lets check with pcc" [puppet] - 10https://gerrit.wikimedia.org/r/767903 (https://phabricator.wikimedia.org/T300985) (owner: 10JHathaway) [00:12:50] (03PS2) 10Cwhite: profile: update graphite mediawiki grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763819 (https://phabricator.wikimedia.org/T211982) [00:12:52] (03PS2) 10Cwhite: maps: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763822 (https://phabricator.wikimedia.org/T211982) [00:12:54] (03PS2) 10Cwhite: search: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763823 (https://phabricator.wikimedia.org/T211982) [00:12:56] (03PS2) 10Cwhite: zuul: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763824 (https://phabricator.wikimedia.org/T211982) [00:12:58] (03PS2) 10Cwhite: zookeeper: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763825 (https://phabricator.wikimedia.org/T211982) [00:13:00] (03PS2) 10Cwhite: kafka: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763826 (https://phabricator.wikimedia.org/T211982) [00:13:02] (03PS2) 10Cwhite: eventlogging: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763827 (https://phabricator.wikimedia.org/T211982) [00:13:04] (03PS2) 10Cwhite: caches: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763830 (https://phabricator.wikimedia.org/T211982) [00:13:06] (03PS2) 10Cwhite: hadoop: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763831 (https://phabricator.wikimedia.org/T211982) [00:13:08] (03PS2) 10Cwhite: graphite: update grafana dashboards links [puppet] - 10https://gerrit.wikimedia.org/r/763832 (https://phabricator.wikimedia.org/T211982) [00:13:53] (03CR) 10jerkins-bot: [V: 04-1] search: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763823 (https://phabricator.wikimedia.org/T211982) (owner: 10Cwhite) [00:14:01] (03PS3) 10Cwhite: graphite: update grafana dashboards links [puppet] - 10https://gerrit.wikimedia.org/r/763832 (https://phabricator.wikimedia.org/T211982) [00:14:37] (03CR) 10jerkins-bot: [V: 04-1] zuul: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763824 (https://phabricator.wikimedia.org/T211982) (owner: 10Cwhite) [00:14:50] (03PS3) 10Cwhite: zuul: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763824 (https://phabricator.wikimedia.org/T211982) [00:15:09] (03CR) 10jerkins-bot: [V: 04-1] zookeeper: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763825 (https://phabricator.wikimedia.org/T211982) (owner: 10Cwhite) [00:15:26] (03PS3) 10Cwhite: zookeeper: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763825 (https://phabricator.wikimedia.org/T211982) [00:15:51] (03CR) 10jerkins-bot: [V: 04-1] kafka: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763826 (https://phabricator.wikimedia.org/T211982) (owner: 10Cwhite) [00:16:13] (03PS3) 10Cwhite: kafka: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763826 (https://phabricator.wikimedia.org/T211982) [00:16:43] (03CR) 10jerkins-bot: [V: 04-1] eventlogging: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763827 (https://phabricator.wikimedia.org/T211982) (owner: 10Cwhite) [00:17:36] (03PS3) 10Cwhite: eventlogging: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763827 (https://phabricator.wikimedia.org/T211982) [00:17:39] (03CR) 10jerkins-bot: [V: 04-1] caches: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763830 (https://phabricator.wikimedia.org/T211982) (owner: 10Cwhite) [00:17:42] (03PS4) 10Cwhite: eventlogging: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763827 (https://phabricator.wikimedia.org/T211982) [00:18:27] (03PS3) 10Cwhite: caches: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763830 (https://phabricator.wikimedia.org/T211982) [00:18:41] (03CR) 10jerkins-bot: [V: 04-1] hadoop: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763831 (https://phabricator.wikimedia.org/T211982) (owner: 10Cwhite) [00:19:10] (03PS3) 10Cwhite: hadoop: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763831 (https://phabricator.wikimedia.org/T211982) [00:19:58] (03CR) 10jerkins-bot: [V: 04-1] graphite: update grafana dashboards links [puppet] - 10https://gerrit.wikimedia.org/r/763832 (https://phabricator.wikimedia.org/T211982) (owner: 10Cwhite) [00:22:54] (03PS3) 10Cwhite: search: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763823 (https://phabricator.wikimedia.org/T211982) [00:24:45] (03CR) 10Cwhite: [C: 03+2] graphite: update grafana dashboards links [puppet] - 10https://gerrit.wikimedia.org/r/763832 (https://phabricator.wikimedia.org/T211982) (owner: 10Cwhite) [00:24:46] RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:26:05] (03CR) 10Cwhite: [C: 03+2] kafka: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763826 (https://phabricator.wikimedia.org/T211982) (owner: 10Cwhite) [00:28:56] (03CR) 10Cwhite: "Has this been tested with Grafana 8?" [software] - 10https://gerrit.wikimedia.org/r/767118 (owner: 10Filippo Giunchedi) [00:32:25] (03CR) 10Cwhite: [C: 03+1] "The logstash file here is just a test fixture." [puppet] - 10https://gerrit.wikimedia.org/r/767788 (https://phabricator.wikimedia.org/T273915) (owner: 10Alexandros Kosiaris) [00:33:52] (03CR) 10Cwhite: [C: 03+2] aptrepo: update grafana version to <8.4 [puppet] - 10https://gerrit.wikimedia.org/r/767608 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [00:40:44] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 56.41 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:43:36] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 82.28 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?panelId=6&fullscreen&orgId=1 [01:22:55] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/apertium: apply [01:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:23:43] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/apertium: apply [01:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:23:44] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/blubberoid: apply [01:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:24:16] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/blubberoid: apply [01:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:24:17] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [01:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:24:58] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [01:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:24:59] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [01:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:56] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [01:25:57] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [01:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:03] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [01:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:05] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [01:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:29:31] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [01:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:29:32] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [01:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:30:30] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [01:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:30:31] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [01:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:31:19] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [01:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:31:20] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [01:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:31:50] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [01:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:31:51] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [01:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:56] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [01:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:58] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/mathoid: apply [01:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:42] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply [01:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:43] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/proton: apply [01:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:55] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [01:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:56] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [01:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:44] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [01:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:55] (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:40:13] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10RLazarus) [01:42:55] (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:43:07] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:46:35] hey urbanecm, I couldn't poach T302973 off of you could I....? ^^ [01:46:35] T302973: Temporary lift IP cap for WikiGap edit-a-thon at Khawarizmi College in 7 March 2022 - https://phabricator.wikimedia.org/T302973 [01:46:53] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10RLazarus) 1.15.4 is still running in a few places on k8s -- after bumping the default version, I rolled out all services where that was the only diff. Some servic... [01:47:55] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:49:05] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10RLazarus) [02:17:05] (03PS1) 10Samtar: Raise $wgAutoblockExpiry from 1 day to 3 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767912 (https://phabricator.wikimedia.org/T43479) [02:17:52] (03CR) 10jerkins-bot: [V: 04-1] Raise $wgAutoblockExpiry from 1 day to 3 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767912 (https://phabricator.wikimedia.org/T43479) (owner: 10Samtar) [02:19:50] (03PS2) 10Samtar: Raise $wgAutoblockExpiry from 1 day to 3 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767912 (https://phabricator.wikimedia.org/T43479) [02:32:03] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:11:05] PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [03:20:57] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:23:45] RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [03:28:21] PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:33:45] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:41:45] PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [03:48:35] RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [03:51:37] PROBLEM - Persistent high iowait on labstore1006 is CRITICAL: 15.88 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005-1006-1007 [04:05:33] PROBLEM - Persistent high iowait on labstore1006 is CRITICAL: 10.16 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005-1006-1007 [04:07:09] PROBLEM - SSH on analytics1067.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:10:53] PROBLEM - Persistent high iowait on labstore1006 is CRITICAL: 11.69 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005-1006-1007 [04:13:33] RECOVERY - Persistent high iowait on labstore1006 is OK: (C)10 ge (W)5 ge 3.831 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005-1006-1007 [04:29:59] RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:42:55] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [05:45:57] (03PS1) 10Ladsgroup: Revert "db1148: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/767814 [05:46:18] (03PS2) 10Ladsgroup: Revert "db1148: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/767814 [05:46:52] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1148: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/767814 (owner: 10Ladsgroup) [06:20:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org [07:12:07] RECOVERY - SSH on analytics1067.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:17:10] (03PS1) 10Elukey: istio: add the install-cni docker file [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/767924 [07:19:42] (03PS1) 10Majavah: Remove hiera files for nonexistent Cloud VPS instances [puppet] - 10https://gerrit.wikimedia.org/r/767925 [07:20:13] (03CR) 10jerkins-bot: [V: 04-1] Remove hiera files for nonexistent Cloud VPS instances [puppet] - 10https://gerrit.wikimedia.org/r/767925 (owner: 10Majavah) [07:20:42] (03PS2) 10Majavah: Remove hiera files for nonexistent Cloud VPS instances [puppet] - 10https://gerrit.wikimedia.org/r/767925 [07:23:00] (03PS2) 10Elukey: istio: add the install-cni docker file [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/767924 (https://phabricator.wikimedia.org/T297612) [07:27:06] !log push pfw policies - T303003 [07:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:47] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:51:13] (03PS2) 10Alexandros Kosiaris: rdb1011: Switch to master [puppet] - 10https://gerrit.wikimedia.org/r/767733 (https://phabricator.wikimedia.org/T281217) [07:55:09] PROBLEM - SSH on dumpsdata1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:57:28] (03CR) 10Alexandros Kosiaris: [C: 03+2] rdb1011: Switch to master [puppet] - 10https://gerrit.wikimedia.org/r/767733 (https://phabricator.wikimedia.org/T281217) (owner: 10Alexandros Kosiaris) [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220304T0800) [08:00:34] 10SRE, 10Product-Infrastructure-Team-Backlog, 10serviceops, 10Maps (Geoshapes), and 2 others: New Service Request geoshapes - https://phabricator.wikimedia.org/T274388 (10MSantos) >>! In T274388#7751113, @akosiaris wrote: >>>! In T274388#7744335, @MSantos wrote: >>> Set up the traffic layer to send traffic... [08:04:56] (03PS2) 10Alexandros Kosiaris: rdb1005: Remove from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/767734 (https://phabricator.wikimedia.org/T281217) [08:05:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] alertmanager: add basic wmcs routing rules [puppet] - 10https://gerrit.wikimedia.org/r/767790 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah) [08:11:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [08:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [08:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1167.eqiad.wmnet with reason: Maintenance [08:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1167.eqiad.wmnet with reason: Maintenance [08:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T300992)', diff saved to https://phabricator.wikimedia.org/P21805 and previous config saved to /var/cache/conftool/dbconfig/20220304-081210-ladsgroup.json [08:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:13] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [08:14:16] (03PS3) 10Alexandros Kosiaris: rdb1005: Remove from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/767734 (https://phabricator.wikimedia.org/T281217) [08:14:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T300992)', diff saved to https://phabricator.wikimedia.org/P21806 and previous config saved to /var/cache/conftool/dbconfig/20220304-081417-ladsgroup.json [08:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:59] (03CR) 10Alexandros Kosiaris: [C: 03+2] rdb1005: Remove from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/767734 (https://phabricator.wikimedia.org/T281217) (owner: 10Alexandros Kosiaris) [08:17:07] (03PS1) 10Hashar: gerrit: prevent 'null' entry in email [puppet] - 10https://gerrit.wikimedia.org/r/768005 (https://phabricator.wikimedia.org/T288312) [08:17:17] 10SRE, 10LDAP-Access-Requests: Grant Access to for OTichonova - https://phabricator.wikimedia.org/T302986 (10JMeybohm) 05Open→03Resolved a:03JMeybohm [08:17:52] (03CR) 10Hashar: "https://phabricator.wikimedia.org/T288312#7752288 has the detailed rationale :)" [puppet] - 10https://gerrit.wikimedia.org/r/768005 (https://phabricator.wikimedia.org/T288312) (owner: 10Hashar) [08:19:13] !log akosiaris@cumin1001 START - Cookbook sre.hosts.decommission for hosts rdb[1005-1006].eqiad.wmnet [08:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:40] !log akosiaris@cumin1001 START - Cookbook sre.dns.netbox [08:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:42] 10SRE, 10Product-Infrastructure-Team-Backlog, 10serviceops, 10Maps (Geoshapes), and 2 others: New Service Request geoshapes - https://phabricator.wikimedia.org/T274388 (10akosiaris) >>! In T274388#7752324, @MSantos wrote: >>>! In T274388#7751113, @akosiaris wrote: >>>>! In T274388#7744335, @MSantos wrote:... [08:28:41] PROBLEM - Check systemd state on db2073 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:29:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P21807 and previous config saved to /var/cache/conftool/dbconfig/20220304-082922-ladsgroup.json [08:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:06] (03CR) 10Muehlenhoff: O:idp_test: update same site policy and disale pin to session (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/767779 (owner: 10Jbond) [08:32:50] (03PS2) 10Alexandros Kosiaris: scap: Switch mw1306 to mw1318 for scap proxy role [puppet] - 10https://gerrit.wikimedia.org/r/767787 (https://phabricator.wikimedia.org/T273915) [08:32:52] (03PS2) 10Alexandros Kosiaris: mw130[2-6]: Remove and decomission [puppet] - 10https://gerrit.wikimedia.org/r/767788 (https://phabricator.wikimedia.org/T273915) [08:32:54] (03CR) 10Muehlenhoff: O:idp: update same site policy and disale pin to session (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/767780 (owner: 10Jbond) [08:33:27] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: name=mw130[2-6].eqiad.wmnet [08:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:07] !log T303027 depool mw130[2-6]. Old jobrunners/videoscalers, being decommisioned [08:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:10] T303027: decommission mw130[2-6].eqiad.wmnet - https://phabricator.wikimedia.org/T303027 [08:44:15] (03CR) 10Muehlenhoff: envoy-hot-restart: Switch shebang to /usr/bin/python3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/767536 (owner: 10Muehlenhoff) [08:44:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P21808 and previous config saved to /var/cache/conftool/dbconfig/20220304-084427-ladsgroup.json [08:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:25] (03CR) 10Alexandros Kosiaris: [C: 03+2] scap: Switch mw1306 to mw1318 for scap proxy role [puppet] - 10https://gerrit.wikimedia.org/r/767787 (https://phabricator.wikimedia.org/T273915) (owner: 10Alexandros Kosiaris) [08:50:17] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, modulo what Keith said" [puppet] - 10https://gerrit.wikimedia.org/r/767836 (owner: 10Cwhite) [08:55:05] RECOVERY - SSH on dumpsdata1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:56:26] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:33] (03PS3) 10Gehel: [wdqs] switch wdqs1010 to the streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/742670 (https://phabricator.wikimedia.org/T301108) (owner: 10DCausse) [08:58:53] (03CR) 10jerkins-bot: [V: 04-1] [wdqs] switch wdqs1010 to the streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/742670 (https://phabricator.wikimedia.org/T301108) (owner: 10DCausse) [08:59:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T300992)', diff saved to https://phabricator.wikimedia.org/P21809 and previous config saved to /var/cache/conftool/dbconfig/20220304-085932-ladsgroup.json [08:59:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [08:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [08:59:35] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [08:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3318 (T300992)', diff saved to https://phabricator.wikimedia.org/P21810 and previous config saved to /var/cache/conftool/dbconfig/20220304-085939-ladsgroup.json [08:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T300992)', diff saved to https://phabricator.wikimedia.org/P21811 and previous config saved to /var/cache/conftool/dbconfig/20220304-090147-ladsgroup.json [09:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:21] RECOVERY - Check systemd state on db2073 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:05:05] (03CR) 10David Caro: [C: 03+1] "I confirm that none of these hosts and project do exist." [puppet] - 10https://gerrit.wikimedia.org/r/767925 (owner: 10Majavah) [09:09:55] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Dale_Zhou - https://phabricator.wikimedia.org/T303031 (10MGerlach) [09:11:00] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Dale_Zhou - https://phabricator.wikimedia.org/T303031 (10MGerlach) Hi. we have a new formal collaborator onboard: @Dale_Zhou . They need access to HDFS and stat machines for a new research project. Let me know if you requ... [09:11:45] (03PS1) 10Vgutierrez: site: Reimage cp5004 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/768011 (https://phabricator.wikimedia.org/T290005) [09:12:08] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Dale_Zhou - https://phabricator.wikimedia.org/T303031 (10MGerlach) [09:12:40] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts rdb[1005-1006].eqiad.wmnet [09:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:46] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp5004 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/768011 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:14:13] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp5004.eqsin.wmnet with OS buster [09:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:21] 10ops-eqiad, 10decommission-hardware: decommission rdb100[56].eqiad.wmnet - https://phabricator.wikimedia.org/T273139 (10akosiaris) [09:14:27] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp5004.eqsin.wmnet with OS buster [09:15:33] 10ops-eqiad, 10decommission-hardware: decommission rdb100[56].eqiad.wmnet - https://phabricator.wikimedia.org/T273139 (10akosiaris) @Jclark-ctr, @Cmjohnson, @wiki_willy. Hosts have finally been decommissioned and are now powered off and ready for the final stage of unracking. Thanks! [09:16:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P21812 and previous config saved to /var/cache/conftool/dbconfig/20220304-091652-ladsgroup.json [09:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:23] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for ShubhankarP - https://phabricator.wikimedia.org/T303032 (10MGerlach) [09:19:41] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for ShubhankarP - https://phabricator.wikimedia.org/T303032 (10MGerlach) Hi. we have a new formal collaborator onboard: @ShubhankarP . They need access to HDFS and stat machines for a new research project. Let me know if you... [09:20:49] (03PS1) 10Vgutierrez: site: Reimage cp4024 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/768013 (https://phabricator.wikimedia.org/T290005) [09:20:50] everything *.wikimedia.org seems to be responding quite slow for me, not sure if this is a me issue or something else [09:21:17] can't even get grafana to load to see if something is going on [09:21:23] PROBLEM - LVS eventgate-analytics-external eqiad port 4692/tcp - EventGate analytics external endpoint- eventgate-analytics-external.svc.eqiad.wmnet and intake-analytics.wikimedia.org IPv4 on eventgate-analytics-external.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:21:26] PROBLEM - ATS TLS has reduced HTTP availability #page on alert1001 is CRITICAL: cluster=cache_text layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/d/000000479/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [09:21:37] oh? that could be related? [09:21:50] yo [09:21:56] here too [09:22:06] can confirm things are slow/non-working [09:22:13] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability: Buster elasticsearch-curator version not compatible with ELK7 - https://phabricator.wikimedia.org/T257024 (10Aklapper) @herron: Do you agree? ^ [09:22:14] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp4024 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/768013 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:22:25] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp3054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:22:25] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:22:27] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:22:27] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:22:29] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:22:31] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for ShubhankarP - https://phabricator.wikimedia.org/T303032 (10MGerlach) [09:22:41] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:22:56] sigh [09:22:57] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:22:59] RECOVERY - LVS eventgate-analytics-external eqiad port 4692/tcp - EventGate analytics external endpoint- eventgate-analytics-external.svc.eqiad.wmnet and intake-analytics.wikimedia.org IPv4 on eventgate-analytics-external.svc.eqiad.wmnet is OK: OK - Certificate eventgate-analytics-external.discovery.wmnet will expire on Tue 04 Mar 2025 03:18:34 PM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:23:05] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:23:05] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:23:05] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:23:13] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:23:19] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:23:37] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:23:47] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:23:47] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:24:25] 10SRE, 10WMF-General-or-Unknown, 10WMF-Legal, 10Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270 (10MoritzMuehlenhoff) Retroactively finding all contributors to the repository at once is a task which will be humongous and full of obstacles (think... [09:24:29] PROBLEM - PyBal backends health check on lvs3005 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp3062.esams.wmnet are marked down but pooled: testlb6_443: Servers cp3050.esams.wmnet, cp3058.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:24:29] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:25:30] (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [09:25:41] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:26:05] PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:26:11] around [09:26:27] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:26:27] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:26:39] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp3058 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:26:50] esams going weee? [09:27:13] some discussion on #mediawiki_security [09:28:45] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp3054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:28:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [09:29:05] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:29:27] PROBLEM - LVS eventgate-analytics-external eqiad port 4692/tcp - EventGate analytics external endpoint- eventgate-analytics-external.svc.eqiad.wmnet and intake-analytics.wikimedia.org IPv4 on eventgate-analytics-external.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:30:17] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:30:19] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:30:19] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:30:41] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp3054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:30:49] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:30:51] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp3054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:31:01] RECOVERY - LVS eventgate-analytics-external eqiad port 4692/tcp - EventGate analytics external endpoint- eventgate-analytics-external.svc.eqiad.wmnet and intake-analytics.wikimedia.org IPv4 on eventgate-analytics-external.svc.eqiad.wmnet is OK: OK - Certificate eventgate-analytics-external.discovery.wmnet will expire on Tue 04 Mar 2025 03:18:34 PM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:31:09] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:31:21] RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [09:31:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P21813 and previous config saved to /var/cache/conftool/dbconfig/20220304-093157-ladsgroup.json [09:31:58] (03PS1) 10Ladsgroup: Depool esams [dns] - 10https://gerrit.wikimedia.org/r/767817 [09:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:03] (03PS2) 10Ladsgroup: Depool esams [dns] - 10https://gerrit.wikimedia.org/r/767817 [09:32:03] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:32:55] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:32:56] (03PS1) 10Ladsgroup: Depool esams via esams-offline map [dns] - 10https://gerrit.wikimedia.org/r/767818 [09:32:59] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp3050 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:33:01] (03PS2) 10Ladsgroup: Depool esams via esams-offline map [dns] - 10https://gerrit.wikimedia.org/r/767818 [09:33:05] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp3054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:33:15] (03CR) 10Muehlenhoff: [C: 03+1] Depool esams [dns] - 10https://gerrit.wikimedia.org/r/767817 (owner: 10Ladsgroup) [09:33:16] !log restart varnish on cp3060 [09:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:29] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:33:37] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:33:37] PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:33:48] (03CR) 10Ayounsi: [C: 03+1] Depool esams via esams-offline map [dns] - 10https://gerrit.wikimedia.org/r/767818 (owner: 10Ladsgroup) [09:34:00] (03CR) 10Ladsgroup: [C: 03+2] Depool esams via esams-offline map [dns] - 10https://gerrit.wikimedia.org/r/767818 (owner: 10Ladsgroup) [09:34:19] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp3054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:34:59] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp3054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:36:01] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp3054 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.563 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:36:05] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp3060 is OK: HTTP OK: HTTP/1.1 200 OK - 471 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:36:05] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp3060 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:36:05] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp3060 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:36:05] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp3060 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:36:05] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp3060 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:36:05] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp3060 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:36:27] RECOVERY - PyBal backends health check on lvs3005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:36:43] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp3052 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 8.036 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:36:53] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:37:30] !log restart varnish on cp3058 [09:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:43] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp3050 is OK: HTTP OK: HTTP/1.1 200 OK - 474 bytes in 6.468 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:37:57] PROBLEM - LVS eventgate-analytics-external eqiad port 4692/tcp - EventGate analytics external endpoint- eventgate-analytics-external.svc.eqiad.wmnet and intake-analytics.wikimedia.org IPv4 on eventgate-analytics-external.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:38:09] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp3054 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 5.528 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:38:11] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp3054 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 8.832 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:38:17] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp3054 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:38:26] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5004.eqsin.wmnet with reason: host reimage [09:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:39] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp3050 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.795 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:38:43] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp3054 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:38:45] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp3050 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:39:23] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp3058 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:39:23] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp3058 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:39:23] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp3058 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.163 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:39:23] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp3058 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:39:23] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp3058 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:39:23] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp3058 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:39:23] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp3058 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:39:24] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp3058 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:39:35] RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.16 ms [09:40:21] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp3054 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.163 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:40:21] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp3056 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:40:53] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp3050 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:40:55] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp3050 is OK: HTTP OK: HTTP/1.1 200 OK - 474 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:40:59] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp3054 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:41:35] some pages (enwiki) take 30-150 sec to appear, I've got headers if anyone needs anything [09:41:51] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventgate-analytics-external_4692: Servers kubernetes1013.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1008.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1005.eqiad.wmnet are marked down but pooled htt [09:41:51] itech.wikimedia.org/wiki/PyBal [09:41:51] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5004.eqsin.wmnet with reason: host reimage [09:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:13] !log restart varnish on cp3056 [09:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:40] oh, it is 3056. [09:43:59] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventgate-analytics-external_4692: Servers kubernetes1003.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1001.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/ [09:43:59] al [09:44:01] grin: we are aware of the issue and working on that [09:44:18] nice. thanks. then I destroy my evidence. ;) [09:46:35] (03PS1) 10JMeybohm: admin: add sthart to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/768015 (https://phabricator.wikimedia.org/T302929) [09:46:37] (03PS1) 10JMeybohm: admin: Add otich to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/768016 (https://phabricator.wikimedia.org/T302986) [09:47:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T300992)', diff saved to https://phabricator.wikimedia.org/P21814 and previous config saved to /var/cache/conftool/dbconfig/20220304-094702-ladsgroup.json [09:47:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1104.eqiad.wmnet with reason: Maintenance [09:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1104.eqiad.wmnet with reason: Maintenance [09:47:06] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [09:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1104 (T300992)', diff saved to https://phabricator.wikimedia.org/P21815 and previous config saved to /var/cache/conftool/dbconfig/20220304-094710-ladsgroup.json [09:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:25] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp3056 is OK: HTTP OK: HTTP/1.1 200 OK - 471 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:47:26] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp3056 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:47:26] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp3056 is OK: HTTP OK: HTTP/1.1 200 OK - 471 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:48:43] (03CR) 10Muehlenhoff: [C: 03+1] admin: add sthart to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/768015 (https://phabricator.wikimedia.org/T302929) (owner: 10JMeybohm) [09:49:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T300992)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20220304-094918-ladsgroup.json [09:50:33] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp3052 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:50:55] ladsgroup@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [09:51:42] (03PS1) 10Ladsgroup: Revert "Depool esams via esams-offline map" [dns] - 10https://gerrit.wikimedia.org/r/767819 [09:52:09] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp3050 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:52:53] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp3050 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:53:14] (03CR) 10Vgutierrez: [C: 03+1] Revert "Depool esams via esams-offline map" [dns] - 10https://gerrit.wikimedia.org/r/767819 (owner: 10Ladsgroup) [09:53:27] (03CR) 10Ladsgroup: [C: 03+2] Revert "Depool esams via esams-offline map" [dns] - 10https://gerrit.wikimedia.org/r/767819 (owner: 10Ladsgroup) [09:53:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [10:02:09] (03PS1) 10Ayounsi: Blackhole intake-analytics [dns] - 10https://gerrit.wikimedia.org/r/768020 [10:03:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [10:04:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P21816 and previous config saved to /var/cache/conftool/dbconfig/20220304-100427-ladsgroup.json [10:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:29] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@1c8384f]: AF //tion default args [10:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:37] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@1c8384f]: AF //tion default args (duration: 00m 07s) [10:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:34] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/767882 (owner: 10Jbond) [10:10:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:11:37] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:11:51] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:13:13] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 39.52 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:13:19] RECOVERY - LVS eventgate-analytics-external eqiad port 4692/tcp - EventGate analytics external endpoint- eventgate-analytics-external.svc.eqiad.wmnet and intake-analytics.wikimedia.org IPv4 on eventgate-analytics-external.svc.eqiad.wmnet is OK: OK - Certificate eventgate-analytics-external.discovery.wmnet will expire on Tue 04 Mar 2025 03:18:34 PM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:13:35] (03PS1) 10Vgutierrez: vcl: block intake-analytics.wm.o traffic [puppet] - 10https://gerrit.wikimedia.org/r/768021 [10:14:30] (03PS2) 10Vgutierrez: vcl: block intake-analytics.wm.o traffic [puppet] - 10https://gerrit.wikimedia.org/r/768021 [10:15:30] (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [10:16:10] RECOVERY - ATS TLS has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/d/000000479/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [10:16:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:17:47] (03PS1) 10Jbond: varnish: text front end [puppet] - 10https://gerrit.wikimedia.org/r/768022 [10:18:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [10:19:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P21817 and previous config saved to /var/cache/conftool/dbconfig/20220304-101932-ladsgroup.json [10:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:10] (03Abandoned) 10Jbond: varnish: text front end [puppet] - 10https://gerrit.wikimedia.org/r/768022 (owner: 10Jbond) [10:20:38] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org [10:24:42] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5004.eqsin.wmnet with OS buster [10:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:54] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp5004.eqsin.wmnet with OS buster c... [10:25:00] (03PS1) 10Elukey: Increase replicas for eventgate-analytics-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/768024 [10:26:47] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/768024 (owner: 10Elukey) [10:27:22] (03CR) 10JMeybohm: [C: 03+1] Increase replicas for eventgate-analytics-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/768024 (owner: 10Elukey) [10:28:12] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:29:28] !log pool cp5004 with HAProxy as TLS termination layer - T290005 [10:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:33] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [10:30:01] (03CR) 10Elukey: [C: 03+2] Increase replicas for eventgate-analytics-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/768024 (owner: 10Elukey) [10:32:36] PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:34:27] (03CR) 10Jbond: vcl: block intake-analytics.wm.o traffic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768021 (owner: 10Vgutierrez) [10:34:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T300992)', diff saved to https://phabricator.wikimedia.org/P21818 and previous config saved to /var/cache/conftool/dbconfig/20220304-103437-ladsgroup.json [10:34:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1111.eqiad.wmnet with reason: Maintenance [10:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1111.eqiad.wmnet with reason: Maintenance [10:34:40] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [10:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1111 (T300992)', diff saved to https://phabricator.wikimedia.org/P21819 and previous config saved to /var/cache/conftool/dbconfig/20220304-103444-ladsgroup.json [10:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:06] (03Abandoned) 10Ayounsi: Blackhole intake-analytics [dns] - 10https://gerrit.wikimedia.org/r/768020 (owner: 10Ayounsi) [10:35:21] jbond: you're right BTW :) [10:35:29] (03Abandoned) 10Vgutierrez: vcl: block intake-analytics.wm.o traffic [puppet] - 10https://gerrit.wikimedia.org/r/768021 (owner: 10Vgutierrez) [10:36:27] vgutierrez: ahh cool thanks im never 100% sure but after looking mi gussing everything in alternate-domains.inc.vcl is misc ritgh? [10:36:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T300992)', diff saved to https://phabricator.wikimedia.org/P21820 and previous config saved to /var/cache/conftool/dbconfig/20220304-103652-ladsgroup.json [10:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:03] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp4024.ulsfo.wmnet with OS buster [10:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:15] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp4024.ulsfo.wmnet with OS buster [10:38:28] RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [10:47:11] (03CR) 10Volans: [C: 04-1] "I'm getting 403 forbidden. Apart that LGTM, just couple of nits inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (https://phabricator.wikimedia.org/T293209) (owner: 10Filippo Giunchedi) [10:47:31] (03PS1) 10Vgutierrez: site: Reimage cp3059 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/768027 (https://phabricator.wikimedia.org/T290005) [10:48:44] (03PS1) 10Jbond: varnish: rate limit http://intake-analytics.wm.o/ [puppet] - 10https://gerrit.wikimedia.org/r/768028 [10:49:00] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp3059 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/768027 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [10:49:32] (03CR) 10Muehlenhoff: [C: 03+1] admin: Add otich to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/768016 (https://phabricator.wikimedia.org/T302986) (owner: 10JMeybohm) [10:50:15] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp3059.esams.wmnet with OS buster [10:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:29] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp3059.esams.wmnet with OS buster [10:51:44] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:51:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P21821 and previous config saved to /var/cache/conftool/dbconfig/20220304-105157-ladsgroup.json [10:51:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:49] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4024.ulsfo.wmnet with reason: host reimage [10:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:54] PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:56:12] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4024.ulsfo.wmnet with reason: host reimage [10:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:01] (03PS2) 10Jbond: varnish: rate limit http://intake-analytics.wm.o/ [puppet] - 10https://gerrit.wikimedia.org/r/768028 [10:58:54] (03CR) 10Jbond: "otto are you able to advice on what would make a reasonable rate limit for intake-analytics" [puppet] - 10https://gerrit.wikimedia.org/r/768028 (owner: 10Jbond) [11:04:21] (03PS5) 10Jbond: O:idp_test: update same site policy and disale pin to session [puppet] - 10https://gerrit.wikimedia.org/r/767779 [11:04:49] (03CR) 10Jbond: O:idp_test: update same site policy and disale pin to session (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/767779 (owner: 10Jbond) [11:04:52] (03PS16) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [11:05:11] (03CR) 10jerkins-bot: [V: 04-1] Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [11:05:15] (03PS5) 10Jbond: O:idp: update same site policy and disale pin to session [puppet] - 10https://gerrit.wikimedia.org/r/767780 [11:06:11] (03CR) 10Jbond: O:idp: update same site policy and disale pin to session (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/767780 (owner: 10Jbond) [11:06:21] (03PS6) 10Jbond: O:idp: update same site policy and disale pin to session [puppet] - 10https://gerrit.wikimedia.org/r/767780 [11:06:30] (03CR) 10Jbond: [C: 03+2] C:apereo_cas: make session configurable [puppet] - 10https://gerrit.wikimedia.org/r/767765 (owner: 10Jbond) [11:07:01] (03PS17) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [11:07:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P21822 and previous config saved to /var/cache/conftool/dbconfig/20220304-110702-ladsgroup.json [11:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:19] vgutierrez: you happy for me to merge 8832fd263c [11:07:27] (03CR) 10jerkins-bot: [V: 04-1] Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [11:07:46] jbond: yes go ahead please [11:08:01] vgutierrez: merged [11:08:06] thx [11:08:09] np [11:08:26] (03PS18) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [11:08:59] (03CR) 10jerkins-bot: [V: 04-1] Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [11:09:21] !log pool cp4024 with HAProxy as TLS termination layer - T290005 [11:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:24] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [11:10:11] (03Abandoned) 10Jbond: O:reposync: document the need for KEYHOLDER_SOCK [puppet] - 10https://gerrit.wikimedia.org/r/764755 (owner: 10Jbond) [11:10:24] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:netbox::automation: Add reposync with netbox-hiera bare repo [puppet] - 10https://gerrit.wikimedia.org/r/767882 (owner: 10Jbond) [11:14:44] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4024.ulsfo.wmnet with OS buster [11:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:56] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp4024.ulsfo.wmnet with OS buster c... [11:16:26] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Data-Engineering, 10Event-Platform, and 2 others: Banner sampling leading to a relatively wide site outage (mostly esams) - https://phabricator.wikimedia.org/T303036 (10jcrespo) [11:18:32] (03CR) 10JMeybohm: [C: 03+2] admin: add sthart to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/768015 (https://phabricator.wikimedia.org/T302929) (owner: 10JMeybohm) [11:18:34] (03CR) 10JMeybohm: [C: 03+2] admin: Add otich to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/768016 (https://phabricator.wikimedia.org/T302986) (owner: 10JMeybohm) [11:18:40] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3059.esams.wmnet with reason: host reimage [11:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T300992)', diff saved to https://phabricator.wikimedia.org/P21823 and previous config saved to /var/cache/conftool/dbconfig/20220304-112207-ladsgroup.json [11:22:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1114.eqiad.wmnet with reason: Maintenance [11:22:09] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3059.esams.wmnet with reason: host reimage [11:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1114.eqiad.wmnet with reason: Maintenance [11:22:10] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [11:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1114 (T300992)', diff saved to https://phabricator.wikimedia.org/P21824 and previous config saved to /var/cache/conftool/dbconfig/20220304-112214-ladsgroup.json [11:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T300992)', diff saved to https://phabricator.wikimedia.org/P21825 and previous config saved to /var/cache/conftool/dbconfig/20220304-112422-ladsgroup.json [11:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:38] (03PS1) 10Hoo man: Wikibase dumps: Lower batch size (reduce run time) [puppet] - 10https://gerrit.wikimedia.org/r/768032 (https://phabricator.wikimedia.org/T300255) [11:29:05] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:29:22] (03PS2) 10Hoo man: Wikibase dumps: Lower batch size (reduce run time) [puppet] - 10https://gerrit.wikimedia.org/r/768032 (https://phabricator.wikimedia.org/T300255) [11:39:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P21826 and previous config saved to /var/cache/conftool/dbconfig/20220304-113927-ladsgroup.json [11:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:45] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/767779 (owner: 10Jbond) [11:43:00] (03CR) 10Muehlenhoff: O:idp: update same site policy and disale pin to session (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/767780 (owner: 10Jbond) [11:44:24] 10SRE, 10DNS, 10Traffic, 10Wikimedia Enterprise: 301 redirect setup for wikimediaenterprise - https://phabricator.wikimedia.org/T302756 (10Protsack.stephan) p:05Triage→03Low [11:54:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P21827 and previous config saved to /var/cache/conftool/dbconfig/20220304-115432-ladsgroup.json [11:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:48] (03PS6) 10Jbond: O:idp_test: update same site policy and disale pin to session [puppet] - 10https://gerrit.wikimedia.org/r/767779 [12:01:08] (03PS7) 10Jbond: O:idp: update same site policy and disale pin to session [puppet] - 10https://gerrit.wikimedia.org/r/767780 [12:04:08] (03CR) 10Jbond: [C: 03+2] O:idp_test: update same site policy and disale pin to session [puppet] - 10https://gerrit.wikimedia.org/r/767779 (owner: 10Jbond) [12:04:11] (03CR) 10Jbond: [C: 03+2] O:idp: update same site policy and disale pin to session [puppet] - 10https://gerrit.wikimedia.org/r/767780 (owner: 10Jbond) [12:04:51] !log enable SameSite=Strict on idp [12:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T300992)', diff saved to https://phabricator.wikimedia.org/P21828 and previous config saved to /var/cache/conftool/dbconfig/20220304-120937-ladsgroup.json [12:09:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1126.eqiad.wmnet with reason: Maintenance [12:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1126.eqiad.wmnet with reason: Maintenance [12:09:40] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [12:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1126 (T300992)', diff saved to https://phabricator.wikimedia.org/P21829 and previous config saved to /var/cache/conftool/dbconfig/20220304-120944-ladsgroup.json [12:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T300992)', diff saved to https://phabricator.wikimedia.org/P21830 and previous config saved to /var/cache/conftool/dbconfig/20220304-121152-ladsgroup.json [12:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:53] 10SRE: Domain Ownership Verification on Various Search Properties - https://phabricator.wikimedia.org/T302617 (10SCherukuwada) Just had a discussion with @jcrespo. To understand what each of these webmaster consoles provides and what ACLs and such they support (so that we can have a process around giving access... [12:14:49] 10SRE, 10WMF-General-or-Unknown, 10WMF-Legal, 10Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270 (10Ladsgroup) Yes. With this time period, there might be even volunteers who have passed away. This is going to be next to impossible. I'm happy wit... [12:18:15] 10SRE, 10observability: grafana-ldap-users-sync failing with Grafana 8 - https://phabricator.wikimedia.org/T303041 (10MoritzMuehlenhoff) [12:26:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P21831 and previous config saved to /var/cache/conftool/dbconfig/20220304-122656-ladsgroup.json [12:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:19] (03PS7) 10Jelto: gitlab_runner: execute gitlab-runner as non-root [puppet] - 10https://gerrit.wikimedia.org/r/759254 (https://phabricator.wikimedia.org/T295481) [12:35:58] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34084/console" [puppet] - 10https://gerrit.wikimedia.org/r/759254 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [12:37:39] (03CR) 10Tchanders: "According to Niharika, we're good to go ahead with this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767216 (https://phabricator.wikimedia.org/T260598) (owner: 10Tchanders) [12:39:59] (03CR) 10Jelto: [V: 03+1] gitlab_runner: execute gitlab-runner as non-root (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/759254 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [12:42:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P21832 and previous config saved to /var/cache/conftool/dbconfig/20220304-124201-ladsgroup.json [12:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:39] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:44:21] RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms [12:50:03] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:50:24] (03CR) 10Muehlenhoff: Require Python 3.7/buster for logout scripts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/767064 (owner: 10Muehlenhoff) [12:52:21] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/759254 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [12:54:23] PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:57:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T300992)', diff saved to https://phabricator.wikimedia.org/P21833 and previous config saved to /var/cache/conftool/dbconfig/20220304-125706-ladsgroup.json [12:57:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1177.eqiad.wmnet with reason: Maintenance [12:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1177.eqiad.wmnet with reason: Maintenance [12:57:10] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [12:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T300992)', diff saved to https://phabricator.wikimedia.org/P21834 and previous config saved to /var/cache/conftool/dbconfig/20220304-125714-ladsgroup.json [12:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T300992)', diff saved to https://phabricator.wikimedia.org/P21835 and previous config saved to /var/cache/conftool/dbconfig/20220304-125921-ladsgroup.json [12:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:21] (03PS2) 10Cathal Mooney: Add several ASNs to those that alert as critical from Icinga [puppet] - 10https://gerrit.wikimedia.org/r/767862 (https://phabricator.wikimedia.org/T299758) [13:00:51] (03PS3) 10Alexandros Kosiaris: mw130[2-6]: Remove and decomission [puppet] - 10https://gerrit.wikimedia.org/r/767788 (https://phabricator.wikimedia.org/T273915) [13:01:15] RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.34 ms [13:02:41] (03CR) 10jerkins-bot: [V: 04-1] mw130[2-6]: Remove and decomission [puppet] - 10https://gerrit.wikimedia.org/r/767788 (https://phabricator.wikimedia.org/T273915) (owner: 10Alexandros Kosiaris) [13:03:35] (03CR) 10Cathal Mooney: Add several ASNs to those that alert as critical from Icinga (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/767862 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [13:05:37] 10SRE, 10ops-eqsin, 10Traffic: SMART error (CurrentPendingSector) detected on host: cp5004 - https://phabricator.wikimedia.org/T303043 (10JMeybohm) [13:07:21] (03PS1) 10SCherukuwada: Added DNS verification records for Bing and Yandex to prove our ownership of wikipedia.org. This is a required step for obtaining access to their respective Webmaster tools. This has already been done for Google's Search Console. [dns] - 10https://gerrit.wikimedia.org/r/768037 [13:09:07] (03CR) 10Alexandros Kosiaris: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/767788 (https://phabricator.wikimedia.org/T273915) (owner: 10Alexandros Kosiaris) [13:10:14] (03PS2) 10SCherukuwada: Added DNS verification records for Bing and Yandex to prove our ownership of wikipedia.org. This is a required step for obtaining access to their respective Webmaster tools. This has already been done for Google's Search Console. [dns] - 10https://gerrit.wikimedia.org/r/768037 [13:10:35] (03PS1) 10JMeybohm: Enable tls proxy telemetry by default in eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/768038 (https://phabricator.wikimedia.org/T303042) [13:11:17] PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:12:01] (03CR) 10Alexandros Kosiaris: [C: 03+2] mw130[2-6]: Remove and decomission [puppet] - 10https://gerrit.wikimedia.org/r/767788 (https://phabricator.wikimedia.org/T273915) (owner: 10Alexandros Kosiaris) [13:13:23] (03CR) 10JMeybohm: [C: 03+1] envoy-hot-restart: Switch shebang to /usr/bin/python3 [puppet] - 10https://gerrit.wikimedia.org/r/767536 (owner: 10Muehlenhoff) [13:13:44] (03PS1) 10Filippo Giunchedi: hiera: allow access to am api to cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/768039 (https://phabricator.wikimedia.org/T291946) [13:14:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P21836 and previous config saved to /var/cache/conftool/dbconfig/20220304-131426-ladsgroup.json [13:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:09] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: execute gitlab-runner as non-root [puppet] - 10https://gerrit.wikimedia.org/r/759254 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [13:17:48] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34085/console" [puppet] - 10https://gerrit.wikimedia.org/r/768039 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [13:19:02] !log akosiaris@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw[1302-1306].eqiad.wmnet [13:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:25] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for ShubhankarP - https://phabricator.wikimedia.org/T303032 (10JMeybohm) * `researchers` group is marked as deprecated (T268801) * /cc @Ottomata || @odimitrijevic for `analytics-privatedata-users` approcal [13:21:39] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Dale_Zhou - https://phabricator.wikimedia.org/T303031 (10JMeybohm) * `researchers` group is marked as deprecated (T268801) * /cc @Ottomata || @odimitrijevic for `analytics-privatedata-users` approval [13:23:44] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] hiera: allow access to am api to cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/768039 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [13:24:45] PROBLEM - Check systemd state on cp3059 is CRITICAL: CRITICAL - degraded: The following units failed: varnishncsa.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:27:09] (03PS1) 10Jelto: gitlab_runner: fix service definition [puppet] - 10https://gerrit.wikimedia.org/r/768040 (https://phabricator.wikimedia.org/T295481) [13:29:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P21837 and previous config saved to /var/cache/conftool/dbconfig/20220304-132931-ladsgroup.json [13:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:59] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:32:38] (03PS3) 10SCherukuwada: Added DNS verification records for Bing and Yandex Webmaster tools. [dns] - 10https://gerrit.wikimedia.org/r/768037 [13:32:48] (03PS4) 10SCherukuwada: Add DNS verification records for Bing and Yandex Webmaster tools. [dns] - 10https://gerrit.wikimedia.org/r/768037 [13:33:33] (03PS5) 10SCherukuwada: Add DNS verification records for Bing and Yandex Webmaster tools. [dns] - 10https://gerrit.wikimedia.org/r/768037 [13:35:57] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:37:13] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:37:54] (03PS1) 10ArielGlenn: Handle exceptions from getting web requests properly [puppet] - 10https://gerrit.wikimedia.org/r/768045 (https://phabricator.wikimedia.org/T302930) [13:38:16] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Wikibase dumps: Lower batch size (reduce run time) [puppet] - 10https://gerrit.wikimedia.org/r/768032 (https://phabricator.wikimedia.org/T300255) (owner: 10Hoo man) [13:38:56] !log akosiaris@cumin1001 START - Cookbook sre.dns.netbox [13:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:07] RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.58 ms [13:39:23] (03PS2) 10Jelto: gitlab_runner: fix service definition [puppet] - 10https://gerrit.wikimedia.org/r/768040 (https://phabricator.wikimedia.org/T295481) [13:41:34] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34087/console" [puppet] - 10https://gerrit.wikimedia.org/r/768040 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [13:42:09] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: fix service definition [puppet] - 10https://gerrit.wikimedia.org/r/768040 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [13:44:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T300992)', diff saved to https://phabricator.wikimedia.org/P21838 and previous config saved to /var/cache/conftool/dbconfig/20220304-134436-ladsgroup.json [13:44:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1178.eqiad.wmnet with reason: Maintenance [13:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1178.eqiad.wmnet with reason: Maintenance [13:44:40] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [13:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1178 (T300992)', diff saved to https://phabricator.wikimedia.org/P21839 and previous config saved to /var/cache/conftool/dbconfig/20220304-134443-ladsgroup.json [13:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:08] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T300992)', diff saved to https://phabricator.wikimedia.org/P21840 and previous config saved to /var/cache/conftool/dbconfig/20220304-134651-ladsgroup.json [13:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:41] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for ShubhankarP - https://phabricator.wikimedia.org/T303032 (10Ottomata) Approved. We should have an expiry date for this account as well. [13:48:05] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Dale_Zhou - https://phabricator.wikimedia.org/T303031 (10Ottomata) Approved. We should have an expiry date for this as well. [13:49:05] PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:49:18] (03PS12) 10Filippo Giunchedi: Introduce 'alertmanager' and 'alerting' modules [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (https://phabricator.wikimedia.org/T293209) [13:49:24] (03CR) 10Filippo Giunchedi: Introduce 'alertmanager' and 'alerting' modules (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (https://phabricator.wikimedia.org/T293209) (owner: 10Filippo Giunchedi) [13:49:38] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw[1302-1306].eqiad.wmnet [13:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:30] (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [13:51:29] 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: decommission mw130[2-6].eqiad.wmnet - https://phabricator.wikimedia.org/T303027 (10akosiaris) a:05akosiaris→03None @wiki_willy, @Cmjohnson @Jclark-ctr hosts decommissioned and ready to be unracked. Thanks! [13:52:06] (03CR) 10Ottomata: "I guess this is per varnish node, yes?" [puppet] - 10https://gerrit.wikimedia.org/r/768028 (owner: 10Jbond) [13:56:14] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Kubernetes: (Need By: TBD) rack/setup/install kubernetes20[19|2(012)] - https://phabricator.wikimedia.org/T299470 (10JMeybohm) [14:01:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P21841 and previous config saved to /var/cache/conftool/dbconfig/20220304-140156-ladsgroup.json [14:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:36] (03CR) 10Ottomata: "We should do this for intake-logging.wm.o too, it has a similar intake pipeline as intake-analytics." [puppet] - 10https://gerrit.wikimedia.org/r/768028 (owner: 10Jbond) [14:17:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P21842 and previous config saved to /var/cache/conftool/dbconfig/20220304-141701-ladsgroup.json [14:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:42] (03PS6) 10Jcrespo: Add DNS verification records for Bing and Yandex Webmaster tools [dns] - 10https://gerrit.wikimedia.org/r/768037 (https://phabricator.wikimedia.org/T302617) (owner: 10SCherukuwada) [14:18:29] RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [14:20:56] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org [14:25:25] (03CR) 10Jcrespo: "I may merge this next week, as I may not be around to shepard the change later today (and we had lots of excitement already this morning, " [dns] - 10https://gerrit.wikimedia.org/r/768037 (https://phabricator.wikimedia.org/T302617) (owner: 10SCherukuwada) [14:27:30] (03CR) 10Jcrespo: "Forgot to mention, for context, that this will be applied to 7 other public domains, but the idea is to test functionality (first, on a si" [dns] - 10https://gerrit.wikimedia.org/r/768037 (https://phabricator.wikimedia.org/T302617) (owner: 10SCherukuwada) [14:30:05] (03CR) 10SCherukuwada: Add DNS verification records for Bing and Yandex Webmaster tools (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/768037 (https://phabricator.wikimedia.org/T302617) (owner: 10SCherukuwada) [14:31:47] (03PS1) 10Gerrit maintenance bot: db1144: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/768054 (https://phabricator.wikimedia.org/T302950) [14:32:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T300992)', diff saved to https://phabricator.wikimedia.org/P21844 and previous config saved to /var/cache/conftool/dbconfig/20220304-143206-ladsgroup.json [14:32:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1172.eqiad.wmnet with reason: Maintenance [14:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1172.eqiad.wmnet with reason: Maintenance [14:32:11] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [14:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T300992)', diff saved to https://phabricator.wikimedia.org/P21845 and previous config saved to /var/cache/conftool/dbconfig/20220304-143214-ladsgroup.json [14:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T300992)', diff saved to https://phabricator.wikimedia.org/P21846 and previous config saved to /var/cache/conftool/dbconfig/20220304-143421-ladsgroup.json [14:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:13] !log pool cp3059 with HAProxy as TLS termination layer - T290005 [14:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:21] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [14:43:49] RECOVERY - Check systemd state on cp3059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:46:40] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3059.esams.wmnet with OS buster [14:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:52] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp3059.esams.wmnet with OS buster c... [14:49:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P21847 and previous config saved to /var/cache/conftool/dbconfig/20220304-144926-ladsgroup.json [14:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:03] (03PS1) 10Vgutierrez: prometheus:rules_ops: Provide HAProxy total responses metrics [puppet] - 10https://gerrit.wikimedia.org/r/768056 (https://phabricator.wikimedia.org/T290005) [14:54:20] (03CR) 10jerkins-bot: [V: 04-1] prometheus:rules_ops: Provide HAProxy total responses metrics [puppet] - 10https://gerrit.wikimedia.org/r/768056 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [14:55:02] 10SRE, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) [14:59:36] !log restart elasticsearch_6@production-search-psi-eqiad.service on elastic1049 to resolve CirrusSearchJVMGCOldPoolFlatlined alert [14:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:59] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Seen): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Majavah) [15:00:36] 10SRE, 10Data-Catalog, 10Data-Engineering, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) I have created a patch to operations/deployment-charts that I believe will be a good start in enabling this service. https://gerrit.wikimedia.org/r/... [15:00:56] (CirrusSearchJVMGCOldPoolFlatlined) resolved: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org [15:00:59] 10SRE, 10Data-Catalog, 10Data-Engineering, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) [15:01:05] PROBLEM - SSH on dumpsdata1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:02:14] (03PS2) 10Vgutierrez: prometheus:rules_ops: Provide HAProxy total responses metrics [puppet] - 10https://gerrit.wikimedia.org/r/768056 (https://phabricator.wikimedia.org/T290005) [15:04:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P21848 and previous config saved to /var/cache/conftool/dbconfig/20220304-150433-ladsgroup.json [15:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:24] (03PS1) 10Vgutierrez: prometheus:rules_global: Provide HAProxy availability metrics [puppet] - 10https://gerrit.wikimedia.org/r/768057 [15:08:49] (03PS1) 10Ssingh: aptrepo: add a component for certspotter [puppet] - 10https://gerrit.wikimedia.org/r/768058 [15:10:59] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) [15:11:33] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) How can I tell what the source IP address(es) of my services will be, as seen by the back-end data stores? Will these be predicatabl... [15:12:20] (03PS1) 10Vgutierrez: site: Reimage cp2038 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/768059 (https://phabricator.wikimedia.org/T290005) [15:15:27] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp2038 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/768059 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:16:10] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp2038.codfw.wmnet with OS buster [15:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:22] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp2038.codfw.wmnet with OS buster [15:18:16] (03PS1) 10Vgutierrez: site: Reimage cp1086 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/768062 (https://phabricator.wikimedia.org/T290005) [15:18:40] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) The diagram doesn't cover prometheus support, but it is included. I have added: `prometheus.io/port: 4318` and `prometheus.io/scrap... [15:19:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T300992)', diff saved to https://phabricator.wikimedia.org/P21849 and previous config saved to /var/cache/conftool/dbconfig/20220304-151937-ladsgroup.json [15:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:41] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [15:19:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2079.codfw.wmnet with reason: Maintenance [15:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2079.codfw.wmnet with reason: Maintenance [15:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 12 hosts with reason: Maintenance [15:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 12 hosts with reason: Maintenance [15:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [15:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [15:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3318 (T300992)', diff saved to https://phabricator.wikimedia.org/P21850 and previous config saved to /var/cache/conftool/dbconfig/20220304-152007-ladsgroup.json [15:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T300992)', diff saved to https://phabricator.wikimedia.org/P21851 and previous config saved to /var/cache/conftool/dbconfig/20220304-152114-ladsgroup.json [15:21:15] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp1086 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/768062 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:05] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp1086.eqiad.wmnet with OS buster [15:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:17] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp1086.eqiad.wmnet with OS buster [15:23:51] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus:rules_ops: Provide HAProxy total responses metrics [puppet] - 10https://gerrit.wikimedia.org/r/768056 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:25:28] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, though please add a new group of rules to thanos too (modules/profile/files/thanos/recording_rules.yaml) as part of the effort to mo" [puppet] - 10https://gerrit.wikimedia.org/r/768057 (owner: 10Vgutierrez) [15:25:31] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Seen): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Krinkle) [15:28:09] (03PS1) 10Ssingh: certspotter: update package and replace cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/768065 (https://phabricator.wikimedia.org/T204993) [15:30:11] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34088/console" [puppet] - 10https://gerrit.wikimedia.org/r/768065 (https://phabricator.wikimedia.org/T204993) (owner: 10Ssingh) [15:32:30] 10ops-eqiad, 10Cloud-Services, 10DC-Ops: hw troubleshooting: move cloudcephmon1003.eqiad.wmnet from rack B2 to rack C8 - https://phabricator.wikimedia.org/T303058 (10nskaggs) [15:33:25] 10ops-eqiad, 10Cloud-Services, 10DC-Ops: hw troubleshooting: move cloudcephmon1003.eqiad.wmnet from rack B2 to rack C8 - https://phabricator.wikimedia.org/T303058 (10nskaggs) [15:34:28] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2038.codfw.wmnet with reason: host reimage [15:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:44] !log blackhole IPs - T303055 [15:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:59] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: move cloudcephmon1003.eqiad.wmnet from rack B2 to rack C8 - https://phabricator.wikimedia.org/T303058 (10nskaggs) [15:35:03] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: move cloudcephmon1003.eqiad.wmnet from rack B2 to rack C8 - https://phabricator.wikimedia.org/T303058 (10nskaggs) [15:36:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P21852 and previous config saved to /var/cache/conftool/dbconfig/20220304-153619-ladsgroup.json [15:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:49] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2038.codfw.wmnet with reason: host reimage [15:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:29] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1086.eqiad.wmnet with reason: host reimage [15:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:52] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: move cloudcephmon1003.eqiad.wmnet from rack B2 to rack C8 - https://phabricator.wikimedia.org/T303058 (10nskaggs) [15:41:54] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1086.eqiad.wmnet with reason: host reimage [15:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: move cloudcephmon1003.eqiad.wmnet from rack B2 to rack C8 - https://phabricator.wikimedia.org/T303058 (10nskaggs) @dcaro I'll note that this cephmon is currently connected to a core switch, and not a cloudsw. It only requi... [15:49:46] !log mforns@deploy1002 Started deploy [airflow-dags/analytics_test@1388c61]: (no justification provided) [15:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:54] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics_test@1388c61]: (no justification provided) (duration: 00m 07s) [15:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:21] !log pool cp2038 with HAProxy as TLS termination layer - T290005 [15:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P21854 and previous config saved to /var/cache/conftool/dbconfig/20220304-155124-ladsgroup.json [15:51:25] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [15:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:18] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2038.codfw.wmnet with OS buster [15:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:29] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp2038.codfw.wmnet with OS buster c... [15:58:04] !log pool cp1086 with HAProxy as TLS termination layer - T290005 [15:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:08] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [15:59:29] (03CR) 10Lucas Werkmeister (WMDE): Configure `mul` language code on Test Wikidata and its clients (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755453 (https://phabricator.wikimedia.org/T297393) (owner: 10Lucas Werkmeister (WMDE)) [15:59:59] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1086.eqiad.wmnet with OS buster [16:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:12] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp1086.eqiad.wmnet with OS buster c... [16:00:31] (03CR) 10Ayounsi: [C: 03+1] Add several ASNs to those that alert as critical from Icinga [puppet] - 10https://gerrit.wikimedia.org/r/767862 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [16:00:53] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [16:03:44] !log mforns@deploy1002 Started deploy [airflow-dags/analytics_test@1388c61]: (no justification provided) [16:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:47] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics_test@1388c61]: (no justification provided) (duration: 00m 03s) [16:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:48] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:06:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T300992)', diff saved to https://phabricator.wikimedia.org/P21856 and previous config saved to /var/cache/conftool/dbconfig/20220304-160629-ladsgroup.json [16:06:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [16:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [16:06:35] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [16:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1116.eqiad.wmnet with reason: Maintenance [16:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1116.eqiad.wmnet with reason: Maintenance [16:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:42] (03PS1) 10Lucas Werkmeister (WMDE): Write "unexpectedUnconnectedPage" page prop on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768089 [16:09:44] (03PS1) 10Lucas Werkmeister (WMDE): Write "unexpectedUnconnectedPage" page prop on Test Wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768090 [16:10:06] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:13:17] !log mforns@deploy1002 Started deploy [airflow-dags/analytics_test@1388c61]: (no justification provided) [16:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:27] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics_test@1388c61]: (no justification provided) (duration: 00m 10s) [16:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:31] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability: Buster elasticsearch-curator version not compatible with ELK7 - https://phabricator.wikimedia.org/T257024 (10herron) 05Open→03Resolved SGTM! [16:31:38] 10SRE, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 (10herron) [16:35:16] !log mforns@deploy1002 Started deploy [airflow-dags/analytics_test@1388c61]: (no justification provided) [16:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:24] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics_test@1388c61]: (no justification provided) (duration: 00m 07s) [16:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:35] RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:04] (03PS1) 10Herron: wip [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/768108 [17:09:04] !log mforns@deploy1002 Started deploy [airflow-dags/analytics_test@1388c61]: (no justification provided) [17:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:13] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics_test@1388c61]: (no justification provided) (duration: 00m 08s) [17:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:15] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:12:19] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:27] (03CR) 10Herron: "here's a preview snapshot link for this change https://grafana.wikimedia.org/dashboard/snapshot/xgTaTn2hDhMmUFZi6kf4zSIerSFT170R" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/768108 (https://phabricator.wikimedia.org/T302842) (owner: 10Herron) [17:17:25] PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:20:13] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): SLO dashboard refinements - https://phabricator.wikimedia.org/T302842 (10herron) >>! In T302842#7751951, @RLazarus wrote: > If no objections, I think we should delete the nontemplated dashboard as a hazard to navigation. Sounds good, I've move... [17:22:22] (03PS1) 10Herron: unmanage old static json dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/768110 [17:23:38] (03CR) 10Herron: [V: 03+2 C: 03+2] unmanage old static json dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/768110 (owner: 10Herron) [17:25:04] 10SRE, 10observability: grafana-ldap-users-sync failing with Grafana 8 - https://phabricator.wikimedia.org/T303041 (10herron) [17:27:52] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dale_Zhou - https://phabricator.wikimedia.org/T303031 (10MGerlach) [17:28:13] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ShubhankarP - https://phabricator.wikimedia.org/T303032 (10MGerlach) [17:28:49] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ShubhankarP - https://phabricator.wikimedia.org/T303032 (10MGerlach) >>! In T303032#7752939, @JMeybohm wrote: > * `researchers` group is marked as deprecated (T268801) Thanks @JMeybohm, I was not aware of that. I updated the tas... [17:29:22] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dale_Zhou - https://phabricator.wikimedia.org/T303031 (10MGerlach) >>! In T303031#7752943, @JMeybohm wrote: > * `researchers` group is marked as deprecated (T268801) Thanks @JMeybohm, I was not aware of that. I updated the task... [17:33:58] (03PS1) 10Jbond: utils: create blame-stats script [puppet] - 10https://gerrit.wikimedia.org/r/768114 (https://phabricator.wikimedia.org/T67270) [17:34:29] (03CR) 10jerkins-bot: [V: 04-1] utils: create blame-stats script [puppet] - 10https://gerrit.wikimedia.org/r/768114 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [17:36:57] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:37:37] 10SRE, 10WMF-General-or-Unknown, 10WMF-Legal, 10Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270 (10jbond) See https://gerrit.wikimedia.org/r/768114 i have hacked together a quick script that uses https://github.com/mergestat/mergestat to get sta... [17:39:44] !log mforns@deploy1002 Started deploy [airflow-dags/analytics_test@19520c1]: (no justification provided) [17:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:53] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics_test@19520c1]: (no justification provided) (duration: 00m 08s) [17:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:14] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@19520c1]: (no justification provided) [17:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:21] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@19520c1]: (no justification provided) (duration: 00m 07s) [17:46:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:20] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission mw130[2-6].eqiad.wmnet - https://phabricator.wikimedia.org/T303027 (10wiki_willy) a:03Cmjohnson [17:48:03] !log btullis@cumin1001 START - Cookbook sre.dns.netbox [17:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:27] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission rdb100[56].eqiad.wmnet - https://phabricator.wikimedia.org/T273139 (10wiki_willy) a:03Cmjohnson [17:49:34] (03PS2) 10C. Scott Ananian: Revert "Enable Parsoid API everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763779 (https://phabricator.wikimedia.org/T302081) [17:50:56] (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [17:57:04] !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:29] !log btullis@cumin1001 START - Cookbook sre.dns.netbox [17:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:48] !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:33] RECOVERY - SSH on dumpsdata1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:12:59] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:19:11] RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:47:16] 10SRE, 10ops-eqiad, 10DC-Ops: Q3: install 2 new HDD into centrallog1001 - https://phabricator.wikimedia.org/T302437 (10herron) [18:48:14] 10SRE, 10ops-eqiad, 10DC-Ops: Q3: install 2 new HDD into centrallog1001 - https://phabricator.wikimedia.org/T302437 (10herron) Hey @Jclark-ctr, could we schedule an installation window for next week? [18:48:58] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dale_Zhou - https://phabricator.wikimedia.org/T303031 (10Dale_Zhou) [18:56:54] 10SRE, 10SRE-Access-Requests: Request Administrator Access to Google Search Console - https://phabricator.wikimedia.org/T302625 (10dr0ptp4kt) Approved. [19:07:19] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 127 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:10:09] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 24 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:50:35] PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:51:23] (03CR) 10STran: Autopromote-once users to the 'ipinfo' group after one edit (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767845 (https://phabricator.wikimedia.org/T296184) (owner: 10Tchanders) [20:07:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: move cloudcephmon1003.eqiad.wmnet from rack B2 to rack C8 - https://phabricator.wikimedia.org/T303058 (10wiki_willy) a:03Cmjohnson [20:11:47] 10SRE, 10ops-eqiad: 8 x SMF Patches between cages Eqiad - LVS & WMCS - https://phabricator.wikimedia.org/T301419 (10wiki_willy) Hi @Jclark-ctr - I know you finished running these cables on Monday, so just checking if we're good to resolve this task? Thanks, Willy [20:20:27] 10SRE, 10SRE-Access-Requests: Request Administrator Access to Google Search Console - https://phabricator.wikimedia.org/T302625 (10dr0ptp4kt) To answer the question on the creds, no, they don't need to be shared. But delegated access will need to be established. An SRE with access to the creds should be able t... [20:42:31] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:52:23] RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:54:35] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:55:01] PROBLEM - Disk space on stat1008 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=86%): /tmp 0 MB (0% inode=86%): /var/tmp 0 MB (0% inode=86%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops [20:56:46] (03CR) 10Krinkle: [C: 03+1] misc: search-grafana-dashboards.js (031 comment) [software] - 10https://gerrit.wikimedia.org/r/767118 (owner: 10Filippo Giunchedi) [20:59:11] (03PS9) 10Herron: prometheus: sketch out proxied prometheus web with IDP [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) [21:04:45] (03CR) 10Herron: "this is deployed in rough form on pontoon-prometheus" [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [21:07:21] (03PS1) 10Krinkle: clinic-duty: Use Date.parse() and assert.propContains() [software] - 10https://gerrit.wikimedia.org/r/768141 [21:09:10] (03PS2) 10Krinkle: clinic-duty: Use Date.parse() and assert.propContains() [software] - 10https://gerrit.wikimedia.org/r/768141 [21:13:49] (03CR) 10Samtar: [C: 04-1] "this is going to need more discussion 😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767912 (https://phabricator.wikimedia.org/T43479) (owner: 10Samtar) [21:20:03] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:22:44] ACKNOWLEDGEMENT - Disk space on stat1008 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=86%): /tmp 0 MB (0% inode=86%): /var/tmp 0 MB (0% inode=86%): Btullis Looking at this now. T303083 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops [21:28:33] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:29:41] (03PS1) 10Krinkle: clinic-duty: add coverage for work.gcalendarLink() [software] - 10https://gerrit.wikimedia.org/r/768142 [21:29:57] (03PS3) 10Urbanecm: throttle: Add rule for arwiki Wikigap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767885 (https://phabricator.wikimedia.org/T302973) [21:31:19] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:35:13] RECOVERY - Disk space on stat1008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops [21:50:56] (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [23:25:38] Hello. [23:25:57] https://en.wikipedia.org/wiki/Special:Log/MZMcBride is erroring for me. [23:26:06] > Original exception: [68e9ee6e-f530-4cd7-af5c-992a3ae2d02c] 2022-03-04 23:25:49: Fatal exception of type "Wikimedia\Rdbms\DBQueryError" [23:26:47] Oona: "Query execution was interrupted (max_statement_time exceeded)" [23:27:09] looks like there is a 30s time limit on that select [23:27:27] Something seems funky. [23:27:36] Special:Log shouldn't be that slow. [23:28:41] Hm, I thought it was a bigger issue again like the upstream connect bug. Thanks for the debug info. [23:30:25] there are 4 joins and a long list of where clause conditions. I agree that in theory that should be a reasonably quick select. And I guess the `SET STATEMENT max_statement_time=30 FOR ` preamble means that someone wanted to protect against badness. [23:32:32] Maybe the way we're partitioning by date making it expensive to get 50 results? Idk. [23:33:00] Lego points out it's also failing to render an error page so it's probably a true bug. [23:33:39] yeah this is certainly going "Boom!" [23:35:50] Probably a bad query plan and then some query in the exception handler is failing? [23:35:55] I wonder if it is change tag things causing problems again? That was just a few weeks ago right? [23:36:45] Is there a way to query how many internal errors we're getting for Special:Log/* ? [23:37:19] Yeah, could be done via logstash [23:37:23] I'm wondering if it's specific to my unusual account history. [23:37:29] Filing a quick task anyway for now, shrug. [23:38:14] If it were more widespread in theory it would increase 5xx counts, showing up in monitoring [23:39:46] bd808: yeah, but if we're thinking about the same thing it affected Recentchanges rather than Log [23:40:31] But I think it most likely is a join like change_tags messing with the query plan [23:41:21] for what it's worth I'm only finding 5 errors like that one in the last 15 minutes of logstash data and all of them are for Oona's user [23:42:01] so it may be related to the specific user's actions [23:42:18] Yeah, seems reasonable [23:42:34] select * from logging_userindex join actor on actor_id = log_actor where actor_name = 'MZMcBride' order by log_timestamp desc limit 50\G [23:42:48] Is super fast. But not many fancy joins. Ah well. [23:42:52] zooming out to 24 hours and still only 6 hits [23:43:03] Truly a special case. ;-) [23:43:19] you are always one in 7 billion Oona [23:44:02] The oldest now that we've ever been and the youngest we'll ever be. [23:48:30] Oona, legoktm: https://phabricator.wikimedia.org/P21857 is the sql query that is timing out. [23:50:11] Oona: breaking things I see? [23:52:34] hrm [23:52:58] i filed something yesterday, then concluded it had been a false alarm, one sec while i dig that up [23:53:37] T303010 [23:53:38] T303010: Wikimedia\Rdbms\DBQueryError: Error 1969: Query execution was interrupted (max_statement_time exceeded) (db1096:3316) Function: [function] - https://phabricator.wikimedia.org/T303010 [23:53:45] TheresNoTime: Suffering the software, another day as usual. [23:55:11] (03CR) 10Cwhite: [C: 03+1] misc: search-grafana-dashboards.js (031 comment) [software] - 10https://gerrit.wikimedia.org/r/767118 (owner: 10Filippo Giunchedi) [23:55:12] brennen: Hrm. [23:55:41] Any idea when max_statement_time=30 was added? [23:57:08] A while ago [23:59:10] T297708 maybe? [23:59:11] T297708: Set max execution time for several expensive mediawiki actions - https://phabricator.wikimedia.org/T297708 [23:59:40] 10SRE, 10observability: grafana-ldap-users-sync failing with Grafana 8 - https://phabricator.wikimedia.org/T303041 (10colewhite) [23:59:57] https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/IPJNO75HYAQWIGTHI5LJHTDVLVOC4LJP/