[00:07:30] !log mwscript extensions/TimedMediaHandler/maintenance/requeueTranscodes.php --wiki=testwiki --key=120p.vp9.webm # T312153 [00:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:34] T312153: Batch run of TMH requeueTranscodes to remove now-unused 120p and 180p low-res files - https://phabricator.wikimedia.org/T312153 [00:08:28] !log mwscript extensions/TimedMediaHandler/maintenance/requeueTranscodes.php --wiki=testwiki --key=180p.vp9.webm # T312153 [00:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:24] (03CR) 10Jdlrobson: [C: 03+1] Revert gallery changes in 1.40.0-wmf.18 [core] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/880920 (https://phabricator.wikimedia.org/T326990) (owner: 10Bartosz Dziewoński) [00:11:33] (03CR) 10Jdlrobson: [C: 03+1] Revert gallery changes in 1.40.0-wmf.18 & .19 [core] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/880921 (https://phabricator.wikimedia.org/T326990) (owner: 10Bartosz Dziewoński) [00:15:54] (03PS1) 10Zabe: Add script to rename a change tag in wmf prod [extensions/WikimediaMaintenance] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/881030 (https://phabricator.wikimedia.org/T327118) [00:16:06] (03CR) 10Zabe: [C: 03+2] Add script to rename a change tag in wmf prod [extensions/WikimediaMaintenance] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/881030 (https://phabricator.wikimedia.org/T327118) (owner: 10Zabe) [00:16:40] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [extensions/WikimediaMaintenance] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/881030 (https://phabricator.wikimedia.org/T327118) (owner: 10Zabe) [00:17:59] (03Merged) 10jenkins-bot: Add script to rename a change tag in wmf prod [extensions/WikimediaMaintenance] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/881030 (https://phabricator.wikimedia.org/T327118) (owner: 10Zabe) [00:18:24] !log zabe@deploy1002 Started scap: Backport for [[gerrit:881030|Add script to rename a change tag in wmf prod (T327118)]] [00:18:28] T327118: enwiki: Please rename the "discretionary sanctions alert" tag to "contentious topics alert" - https://phabricator.wikimedia.org/T327118 [00:20:10] !log zabe@deploy1002 zabe and zabe: Backport for [[gerrit:881030|Add script to rename a change tag in wmf prod (T327118)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [00:23:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:26:54] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:881030|Add script to rename a change tag in wmf prod (T327118)]] (duration: 08m 29s) [00:26:58] T327118: enwiki: Please rename the "discretionary sanctions alert" tag to "contentious topics alert" - https://phabricator.wikimedia.org/T327118 [00:28:49] !log enwiki: rename the "discretionary sanctions alert" tag to "contentious topics alert" # T327118 [00:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:12] PROBLEM - SSH on cp2031 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:52:44] RECOVERY - SSH on cp2031 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:55:34] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp2031 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [00:55:34] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp2031 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [00:55:38] PROBLEM - HAProxy HTTPS wikiworkshop.org RSA on cp2031 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [00:56:16] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp2031 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [00:56:16] 10SRE, 10SRE-OnFire, 10ops-codfw, 10Sustainability (Incident Followup): asw-b2-codfw down - https://phabricator.wikimedia.org/T327001 (10Papaul) Netbox up to date with the new switch information https://netbox.wikimedia.org/dcim/devices/1883/ [00:57:20] hmm cp2031 doesn't seem to be in a good state, depooling [00:59:51] (03CR) 10Cwhite: [C: 04-1] Add PTR resolution to firewall logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/880889 (https://phabricator.wikimedia.org/T327095) (owner: 10Ayounsi) [01:00:18] (ProbeDown) firing: (2) Service text:80 has failed probes (http_text_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:00:36] PROBLEM - SSH on cp2031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:00:50] I ACKed the page [01:01:13] surely the page isn't just from cp2031? [01:01:16] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T327015 (10Papaul) 05Open→03Resolved a:03Papaul This is all good after the fpc2 has been replaced [01:01:18] I think it is [01:01:18] (ProbeDown) firing: (2) Service text:80 has failed probes (http_text_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:01:31] depooling, I was going to attempt a reboot via IPMI but was checking 2032 [01:01:34] doing now [01:01:48] I wish the AM probes were more readily transparent about exactly what they're checking how, and where [01:01:52] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp2031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [01:02:00] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2031.codfw.wmnet,service=cdn [01:02:00] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2031.codfw.wmnet,service=ats-be [01:02:55] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp2031.codfw.wmnet with reason: downtimed, host unreachable [01:03:10] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp2031.codfw.wmnet with reason: downtimed, host unreachable [01:03:12] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp2031 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 557807 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2023-03-30 14:08:29 +0000 (expires in 71 days) https://wikitech.wikimedia.org/wiki/HTTPS [01:03:12] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp2031 is OK: SSL OK - OCSP staple validity for wikipedia.org has 248207 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-03-25 08:06:40 +0000 (expires in 66 days) https://wikitech.wikimedia.org/wiki/HTTPS [01:03:14] RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp2031 is OK: HTTP OK: HTTP/1.0 200 OK - 36273 bytes in 0.394 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [01:03:16] RECOVERY - HAProxy HTTPS wikiworkshop.org RSA on cp2031 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 309404 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (RSA) valid until 2023-03-30 14:08:36 +0000 (expires in 71 days) https://wikitech.wikimedia.org/wiki/HTTPS [01:03:28] RECOVERY - SSH on cp2031 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:03:52] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp2031 is OK: SSL OK - OCSP staple validity for wikipedia.org has 456967 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-03-25 08:08:42 +0000 (expires in 66 days) https://wikitech.wikimedia.org/wiki/HTTPS [01:04:14] I suspect maybe varnish specifically needed a restart (among possibly other problems on 2031 that monitoring wasn't great at catching). Either way the reboot should patch it up. [01:04:46] yeah this isn't the first time that we have purged issues when there were network failures :) [01:05:18] (ProbeDown) resolved: (2) Service text:80 has failed probes (http_text_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:05:33] L3 is getting to be a bit of a leaky abstraction :) [01:06:15] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp2031.codfw.wmnet [01:06:18] (ProbeDown) resolved: (2) Service text:80 has failed probes (http_text_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:08:54] 10SRE, 10ops-codfw, 10DC-Ops: Decommission mc20[19-27] and mc20[29-37] - https://phabricator.wikimedia.org/T313733 (10Papaul) [01:09:09] 10SRE, 10ops-codfw, 10DC-Ops: Decommission mc20[19-27] and mc20[29-37] - https://phabricator.wikimedia.org/T313733 (10Papaul) 05Open→03Resolved complete [01:13:23] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2031.codfw.wmnet [01:13:47] (03CR) 10Cwhite: WIP: add rt_flow grokking (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/880500 (https://phabricator.wikimedia.org/T325806) (owner: 10Filippo Giunchedi) [01:14:17] looks clean and green, will repool shortly [01:23:42] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) [01:25:53] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) [01:27:05] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2031.codfw.wmnet,service=cdn [01:27:05] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2031.codfw.wmnet,service=ats-be [01:27:09] (03CR) 10Cwhite: "PCC NOOP: https://puppet-compiler.wmflabs.org/output/879886/39159/" [puppet] - 10https://gerrit.wikimedia.org/r/879886 (owner: 10Cwhite) [01:27:41] (03CR) 10Cwhite: "PCC NOOP: https://puppet-compiler.wmflabs.org/output/879887/39160/" [puppet] - 10https://gerrit.wikimedia.org/r/879887 (owner: 10Cwhite) [01:28:41] (03CR) 10Cwhite: "PCC NOOP: https://puppet-compiler.wmflabs.org/output/879888/39161/" [puppet] - 10https://gerrit.wikimedia.org/r/879888 (owner: 10Cwhite) [01:29:10] (03CR) 10Cwhite: "PCC NOOP: https://puppet-compiler.wmflabs.org/output/879889/39162/" [puppet] - 10https://gerrit.wikimedia.org/r/879889 (owner: 10Cwhite) [01:32:56] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) [01:54:15] (03CR) 10Dzahn: peopleweb: ensure rsync service is stopped on passive host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879878 (https://phabricator.wikimedia.org/T326888) (owner: 10Dzahn) [01:55:43] (03PS4) 10Dzahn: peopleweb: ensure rsync service is stopped on passive host [puppet] - 10https://gerrit.wikimedia.org/r/879878 (https://phabricator.wikimedia.org/T326888) [02:04:20] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:12:04] (03CR) 10Xcollazo: "This code is now ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/870974 (https://phabricator.wikimedia.org/T323614) (owner: 10Xcollazo) [02:12:47] (JobUnavailable) firing: (7) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:25:12] (03CR) 10Xcollazo: "Puppet is happy with the changes: https://puppet-compiler.wmflabs.org/output/870974/39165/" [puppet] - 10https://gerrit.wikimedia.org/r/870974 (https://phabricator.wikimedia.org/T323614) (owner: 10Xcollazo) [02:27:46] (JobUnavailable) firing: (8) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:31:14] PROBLEM - SSH on cp2031 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:34:20] PROBLEM - traffic_server backend process restarted on cp2031 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=codfw+prometheus/ops&var-instance=cp2031&var-layer=backend [02:35:42] well, clearly cp2031 is unhappy tonight [02:35:52] I am going to depool it again and we can debug tomorrow, otherwise it will page [02:36:06] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2031.codfw.wmnet,service=cdn [02:36:06] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2031.codfw.wmnet,service=ats-be [02:37:13] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp2031.codfw.wmnet with reason: downtimed, host unreachable [02:37:18] RECOVERY - SSH on cp2031 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:37:28] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp2031.codfw.wmnet with reason: downtimed, host unreachable [02:37:47] (JobUnavailable) firing: (8) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:43:10] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 2 (dbprov1004, ...), Fresh: 121 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [02:51:58] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:52:14] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:53:20] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.214 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:53:36] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49420 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:10:04] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 170 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:11:36] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:36:09] (03PS1) 10KartikMistry: Update cxserver to 2023-01-16-071207-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/881051 (https://phabricator.wikimedia.org/T326236) [04:00:54] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/879889 (owner: 10Cwhite) [04:02:15] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/879888 (owner: 10Cwhite) [04:02:38] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/879887 (owner: 10Cwhite) [04:03:29] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/879886 (owner: 10Cwhite) [04:23:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:42:08] PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve2007 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [04:44:12] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 123 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:20:50] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:43:12] RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve2007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:37:24] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:37:24] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:37:42] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:37:47] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:38:56] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:38:56] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:39:12] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230118T0700) [07:04:02] 10SRE, 10Parsoid, 10Scap, 10serviceops: scap groups on bastions still needed? - https://phabricator.wikimedia.org/T327066 (10Joe) yeah, +1 to killing with fire :) [07:34:05] (03PS5) 10KartikMistry: WIP: Enable Content Translation/Section Translation on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/870080 (https://phabricator.wikimedia.org/T325714) [07:39:44] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10conftool, and 2 others: Scap deploy failed to depool codfw servers - https://phabricator.wikimedia.org/T327041 (10Joe) 05Open→03Resolved This is now fully resolved. [07:53:35] (03CR) 10Elukey: [V: 03+2 C: 03+2] kserve: upgrade to version 0.9 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/880499 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [07:56:01] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade [08:00:05] Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230118T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:07:38] (03PS1) 10Muehlenhoff: bastions: Remove scap:dsh [puppet] - 10https://gerrit.wikimedia.org/r/881348 (https://phabricator.wikimedia.org/T327066) [08:23:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:26:51] (03CR) 10Hashar: gerrit: remove /srv/gerrit/jvmlogs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/880963 (owner: 10Hashar) [08:28:05] (03CR) 10Muehlenhoff: [C: 03+2] bastions: Remove scap:dsh [puppet] - 10https://gerrit.wikimedia.org/r/881348 (https://phabricator.wikimedia.org/T327066) (owner: 10Muehlenhoff) [08:30:23] !log jelto@cumin1001 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) [08:32:21] !log mvernon@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=swift,name=codfw [08:32:33] !log mvernon@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=thanos-swift,name=codfw [08:32:45] !log mvernon@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=thanos-query,name=codfw [08:33:55] 10SRE, 10Parsoid, 10Scap, 10serviceops: scap groups on bastions still needed? - https://phabricator.wikimedia.org/T327066 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff I've removed the Puppet class from the bastions, the existing files will vanish with ongoing reimages. [08:34:16] !log mvernon@cumin1001 conftool action : set/pooled=yes; selector: name=ms-fe2010.codfw.wmnet [08:34:30] !log mvernon@cumin1001 conftool action : set/pooled=yes; selector: name=thanos-fe2002.codfw.wmnet [08:38:04] (03CR) 10Gmodena: [C: 03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/881011 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [08:41:09] (03PS1) 10Jelto: sre.gitlab.upgrade: use gitlab.exceptions in retry [cookbooks] - 10https://gerrit.wikimedia.org/r/881349 (https://phabricator.wikimedia.org/T323569) [08:42:44] (03CR) 10Slyngshede: [C: 03+2] C:apereo_cas Fix regex for IDM [puppet] - 10https://gerrit.wikimedia.org/r/880968 (owner: 10Slyngshede) [08:43:06] (03CR) 10CI reject: [V: 04-1] sre.gitlab.upgrade: use gitlab.exceptions in retry [cookbooks] - 10https://gerrit.wikimedia.org/r/881349 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [08:43:56] (03PS2) 10Jelto: sre.gitlab.upgrade: use gitlab.exceptions in retry [cookbooks] - 10https://gerrit.wikimedia.org/r/881349 (https://phabricator.wikimedia.org/T323569) [08:44:01] (03PS1) 10Hashar: gerrit: remove maintenancce mode parameter [puppet] - 10https://gerrit.wikimedia.org/r/881350 [08:44:45] (03CR) 10Hashar: "I will do that one with John Bond next week." [puppet] - 10https://gerrit.wikimedia.org/r/875315 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [08:51:46] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade fasw to Junos 21 - https://phabricator.wikimedia.org/T316542 (10ayounsi) Some notes from {T316532} Make sure console access works. Before the upgrade, remove this configuration stanza, otherwise the `request system software add... [08:53:40] (03CR) 10Ayounsi: WIP: add rt_flow grokking (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/880500 (https://phabricator.wikimedia.org/T325806) (owner: 10Filippo Giunchedi) [08:54:36] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade [08:56:26] (03PS1) 10Hashar: gerrit: remove Gerrit:AuthType Puppet type [puppet] - 10https://gerrit.wikimedia.org/r/881351 [09:00:04] jnuche and jeena: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230118T0900). [09:03:17] (03CR) 10Ayounsi: Add PTR resolution to firewall logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/880889 (https://phabricator.wikimedia.org/T327095) (owner: 10Ayounsi) [09:05:44] (03PS1) 10Filippo Giunchedi: site: apply webperf::profiling_tools to arclamp1001 [puppet] - 10https://gerrit.wikimedia.org/r/881352 (https://phabricator.wikimedia.org/T319434) [09:05:46] (03PS1) 10Filippo Giunchedi: Move arclamp to arclamp1001 [puppet] - 10https://gerrit.wikimedia.org/r/881353 (https://phabricator.wikimedia.org/T319434) [09:06:04] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881354 (https://phabricator.wikimedia.org/T325582) [09:06:06] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881354 (https://phabricator.wikimedia.org/T325582) (owner: 10TrainBranchBot) [09:06:42] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881354 (https://phabricator.wikimedia.org/T325582) (owner: 10TrainBranchBot) [09:11:50] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: clean up legacy apifeatureusage class [puppet] - 10https://gerrit.wikimedia.org/r/879886 (owner: 10Cwhite) [09:12:03] (03CR) 10Filippo Giunchedi: [C: 03+1] role, profile: remove logstash(7) role and hiera config [puppet] - 10https://gerrit.wikimedia.org/r/879887 (owner: 10Cwhite) [09:13:02] 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) @Jclark-ctr could you provide a rough timeline on when we could expect this to happen? Thanks! [09:13:59] (03CR) 10Filippo Giunchedi: [C: 03+1] "As a followup to this I think should assimilate logstash1032 again into kibana (e.g. add it to conftool and pool to serve traffic) ?" [puppet] - 10https://gerrit.wikimedia.org/r/879888 (owner: 10Cwhite) [09:14:08] (03CR) 10Filippo Giunchedi: [C: 03+1] role, profile: remove elasticsearch role and supporting profile [puppet] - 10https://gerrit.wikimedia.org/r/879889 (owner: 10Cwhite) [09:15:54] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.19 refs T325582 [09:15:58] T325582: 1.40.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T325582 [09:17:52] (03CR) 10Filippo Giunchedi: WIP: add rt_flow grokking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/880500 (https://phabricator.wikimedia.org/T325806) (owner: 10Filippo Giunchedi) [09:19:09] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/881353 (https://phabricator.wikimedia.org/T319434) (owner: 10Filippo Giunchedi) [09:21:01] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/881352 (https://phabricator.wikimedia.org/T319434) (owner: 10Filippo Giunchedi) [09:23:55] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1198 crash due to memory errors - https://phabricator.wikimedia.org/T327107 (10Marostegui) p:05Triage→03Medium [09:24:12] (03PS1) 10Marostegui: db1198: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/881356 (https://phabricator.wikimedia.org/T327107) [09:24:15] !log jnuche@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.19 refs T325582 (duration: 08m 20s) [09:24:19] T325582: 1.40.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T325582 [09:24:44] (03CR) 10Marostegui: [C: 03+2] db1198: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/881356 (https://phabricator.wikimedia.org/T327107) (owner: 10Marostegui) [09:32:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host arclamp1001.eqiad.wmnet [09:33:17] !log jelto@cumin1001 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) [09:35:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host arclamp2001.codfw.wmnet [09:37:12] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/881349 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [09:38:42] 10SRE, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10ayounsi) Seeing what happened with codfw row B, it's safe to assume that only a reboot of the faulty switch member wil... [09:39:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host arclamp1001.eqiad.wmnet [09:41:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host arclamp2001.codfw.wmnet [09:45:46] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/880897 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [09:47:29] (03PS1) 10Hashar: gerrit: split user and application directories [puppet] - 10https://gerrit.wikimedia.org/r/881359 (https://phabricator.wikimedia.org/T323262) [09:48:18] (03CR) 10Jelto: [C: 03+2] sre.gitlab.upgrade: use gitlab.exceptions in retry (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/881349 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [09:48:28] (03PS2) 10Hashar: gerrit: remove maintenance mode parameter [puppet] - 10https://gerrit.wikimedia.org/r/881350 [09:48:30] (03PS2) 10Hashar: gerrit: remove Gerrit:AuthType Puppet type [puppet] - 10https://gerrit.wikimedia.org/r/881351 [09:48:32] (03PS2) 10Hashar: gerrit: split user and application directories [puppet] - 10https://gerrit.wikimedia.org/r/881359 (https://phabricator.wikimedia.org/T323262) [09:48:46] (03CR) 10Elukey: [C: 03+2] kserve: upgrade to upstream version 0.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/880897 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [09:49:37] !log start migration from webperf1004 to arclamp1001 - T319434 [09:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:41] T319434: service implementation tracking: arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319434 [09:50:01] (03Merged) 10jenkins-bot: sre.gitlab.upgrade: use gitlab.exceptions in retry [cookbooks] - 10https://gerrit.wikimedia.org/r/881349 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [09:51:18] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:51:39] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:55:41] (03CR) 10Filippo Giunchedi: [C: 03+2] site: apply webperf::profiling_tools to arclamp1001 [puppet] - 10https://gerrit.wikimedia.org/r/881352 (https://phabricator.wikimedia.org/T319434) (owner: 10Filippo Giunchedi) [09:57:53] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_compress_logs.timer,arclamp_generate_metrics.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:00:44] that's me ^ [10:00:47] (03PS1) 10Jcrespo: dbbackups: Reorganize backups with the new dbprov[12]04 host [puppet] - 10https://gerrit.wikimedia.org/r/881360 (https://phabricator.wikimedia.org/T327155) [10:03:00] (03CR) 10Clément Goubert: [V: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/880895 (https://phabricator.wikimedia.org/T324439) (owner: 10Clément Goubert) [10:04:17] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 120 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:06:23] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 8 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:06:51] (03CR) 10Btullis: [C: 03+2] Add analytics-platform-eng-admins and system user keytab to an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/879648 (https://phabricator.wikimedia.org/T326827) (owner: 10Ottomata) [10:07:18] (03PS1) 10Zabe: Start reading from cuc_comment_id from a few wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881361 (https://phabricator.wikimedia.org/T233004) [10:08:03] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/879878 (https://phabricator.wikimedia.org/T326888) (owner: 10Dzahn) [10:11:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881361 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [10:11:42] (03CR) 10Btullis: [C: 03+1] "Looks good to me and I've deployed the other change upon which this depends." [puppet] - 10https://gerrit.wikimedia.org/r/870974 (https://phabricator.wikimedia.org/T323614) (owner: 10Xcollazo) [10:11:52] (03Merged) 10jenkins-bot: Start reading from cuc_comment_id from a few wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881361 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [10:12:23] !log zabe@deploy1002 Started scap: Backport for [[gerrit:881361|Start reading from cuc_comment_id from a few wikis (T233004)]] [10:12:28] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [10:12:37] (03CR) 10Filippo Giunchedi: [C: 03+2] Move arclamp to arclamp1001 [puppet] - 10https://gerrit.wikimedia.org/r/881353 (https://phabricator.wikimedia.org/T319434) (owner: 10Filippo Giunchedi) [10:12:59] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade [10:14:12] !log zabe@deploy1002 zabe and zabe: Backport for [[gerrit:881361|Start reading from cuc_comment_id from a few wikis (T233004)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [10:21:41] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:881361|Start reading from cuc_comment_id from a few wikis (T233004)]] (duration: 09m 17s) [10:21:44] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [10:23:56] (03CR) 10Filippo Giunchedi: "LGTM overall, see also inline" [puppet] - 10https://gerrit.wikimedia.org/r/880895 (https://phabricator.wikimedia.org/T324439) (owner: 10Clément Goubert) [10:26:12] (03PS3) 10Clément Goubert: logstash: Fix typo in mediawiki.httpd.accesslog [puppet] - 10https://gerrit.wikimedia.org/r/880895 (https://phabricator.wikimedia.org/T324439) [10:27:20] 10SRE, 10SRE-OnFire, 10ops-codfw, 10Sustainability (Incident Followup): asw-b2-codfw down - https://phabricator.wikimedia.org/T327001 (10MatthewVernon) @Eevans do you still need to re-enable Cassandra hints in codfw row b? [10:27:29] (03CR) 10Filippo Giunchedi: [C: 03+1] "Ship it!" [puppet] - 10https://gerrit.wikimedia.org/r/880895 (https://phabricator.wikimedia.org/T324439) (owner: 10Clément Goubert) [10:27:37] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39166/console" [puppet] - 10https://gerrit.wikimedia.org/r/880895 (https://phabricator.wikimedia.org/T324439) (owner: 10Clément Goubert) [10:28:15] (03CR) 10Clément Goubert: [V: 03+1] logstash: Fix typo in mediawiki.httpd.accesslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/880895 (https://phabricator.wikimedia.org/T324439) (owner: 10Clément Goubert) [10:28:48] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] logstash: Fix typo in mediawiki.httpd.accesslog [puppet] - 10https://gerrit.wikimedia.org/r/880895 (https://phabricator.wikimedia.org/T324439) (owner: 10Clément Goubert) [10:36:01] (03PS3) 10Hashar: gerrit: split user and application directories [puppet] - 10https://gerrit.wikimedia.org/r/881359 (https://phabricator.wikimedia.org/T323262) [10:37:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:38:23] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash1023.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:38:39] PROBLEM - OpenSearch health check for shards on 9200 on logstash1023 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fab28454280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [10:38:39] org/wiki/Search%23Administration [10:39:27] PROBLEM - Check systemd state on logstash1023 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch_2@production-elk7-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:39:57] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:40:03] PROBLEM - Check systemd state on logstash1024 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch_2@production-elk7-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:40:05] PROBLEM - OpenSearch health check for shards on 9200 on logstash1024 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fb495c3d280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [10:40:05] org/wiki/Search%23Administration [10:40:13] yes yes, the logstash stuff is known [10:43:13] RECOVERY - OpenSearch health check for shards on 9200 on logstash1023 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 15, number_of_data_nodes: 10, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 660, active_shards: 1489, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [10:43:13] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:43:28] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc1050.eqiad.wmnet with OS bullseye [10:44:09] RECOVERY - Check systemd state on logstash1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:44:55] (LogstashNoLogsIndexed) firing: Logstash logs are not being indexed by Elasticsearch - https://wikitech.wikimedia.org/wiki/Logstash#No_logs_indexed - https://grafana.wikimedia.org/d/000000561/logstash?var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashNoLogsIndexed [10:45:55] (03CR) 10Hashar: "That is a starting point to relocate Gerrit application directory from the Unix user directory hierarchy at /var/lib/gerrit2 toward the a" [puppet] - 10https://gerrit.wikimedia.org/r/881359 (https://phabricator.wikimedia.org/T323262) (owner: 10Hashar) [10:46:04] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/881359 (https://phabricator.wikimedia.org/T323262) (owner: 10Hashar) [10:47:13] 10SRE, 10Infrastructure-Foundations: Define the core attribute list managed in the IDM with all stakeholders - https://phabricator.wikimedia.org/T320805 (10SLyngshede-WMF) Ideally I'd like to reuse the "owner" field from groups as a field to identify who can approve access requests. We need to check that the f... [10:47:49] RECOVERY - Check systemd state on logstash1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:53] RECOVERY - OpenSearch health check for shards on 9200 on logstash1024 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 16, number_of_data_nodes: 10, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 660, active_shards: 1489, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [10:47:53] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:48:12] !log jelto@cumin1001 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) [10:48:44] (03PS1) 10Marostegui: db1176: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/881363 [10:49:19] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade [10:49:55] (LogstashNoLogsIndexed) resolved: Logstash logs are not being indexed by Elasticsearch - https://wikitech.wikimedia.org/wiki/Logstash#No_logs_indexed - https://grafana.wikimedia.org/d/000000561/logstash?var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashNoLogsIndexed [10:50:47] (03CR) 10Marostegui: [C: 03+2] db1176: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/881363 (owner: 10Marostegui) [10:51:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1176 to LB with just 1% weight T326116', diff saved to https://phabricator.wikimedia.org/P43184 and previous config saved to /var/cache/conftool/dbconfig/20230118-105106-marostegui.json [10:51:11] T326116: Package and test MariaDB 11 - https://phabricator.wikimedia.org/T326116 [10:54:52] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1050.eqiad.wmnet with reason: host reimage [10:57:15] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1050.eqiad.wmnet with reason: host reimage [10:59:19] !log volans@cumin1001 START - Cookbook sre.network.cf [10:59:20] !log volans@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230118T1100) [11:00:19] (03PS2) 10Hnowlan: thumbor: add and use haproxy healthz lvs check [puppet] - 10https://gerrit.wikimedia.org/r/880898 (https://phabricator.wikimedia.org/T233196) [11:03:25] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [11:07:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1176 T326116', diff saved to https://phabricator.wikimedia.org/P43185 and previous config saved to /var/cache/conftool/dbconfig/20230118-110716-marostegui.json [11:07:18] (NELHigh) firing: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [11:07:27] T326116: Package and test MariaDB 11 - https://phabricator.wikimedia.org/T326116 [11:07:48] around [11:10:13] !log volans@cumin1001 START - Cookbook sre.network.cf [11:10:15] !log volans@cumin1001 END (FAIL) - Cookbook sre.network.cf (exit_code=1) [11:10:29] !log volans@cumin1001 START - Cookbook sre.network.cf [11:10:31] !log volans@cumin1001 END (FAIL) - Cookbook sre.network.cf (exit_code=1) [11:11:07] !log volans@cumin2002 START - Cookbook sre.network.cf [11:11:08] !log volans@cumin2002 END (FAIL) - Cookbook sre.network.cf (exit_code=1) [11:12:19] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1050.eqiad.wmnet with OS bullseye [11:15:32] !log volans@cumin1001 START - Cookbook sre.network.cf [11:15:32] !log volans@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [11:16:15] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:16:31] !log volans@cumin1001 START - Cookbook sre.network.cf [11:16:31] !log volans@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [11:17:45] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:17:46] (03CR) 10Hokwelum: [C: 03+1] "Thank you for addressing that @Andrew!" [puppet] - 10https://gerrit.wikimedia.org/r/879274 (owner: 10Majavah) [11:19:25] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:20:29] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [11:22:17] (NELHigh) resolved: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [11:22:43] 10SRE-swift-storage: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) [11:27:09] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) [11:28:58] (KubernetesAPILatency) firing: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:31:05] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [11:33:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:40:16] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf and ops for Jennifer Ebe - https://phabricator.wikimedia.org/T327255 (10JEbe-WMF) [11:40:33] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf and ops for Jennifer Ebe - https://phabricator.wikimedia.org/T327255 (10JEbe-WMF) [11:42:25] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade [11:43:00] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf and ops for Jennifer Ebe - https://phabricator.wikimedia.org/T327255 (10JEbe-WMF) [11:43:21] PROBLEM - Check systemd state on an-presto1011 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:41] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf and ops for Jennifer Ebe - https://phabricator.wikimedia.org/T327255 (10JEbe-WMF) Ticket grant Jennifer Ebe LDAP access for Onboarding. [11:43:43] PROBLEM - Check systemd state on an-presto1014 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:51] PROBLEM - Check systemd state on an-presto1010 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:55] PROBLEM - Check systemd state on an-presto1006 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:44:05] PROBLEM - Check systemd state on an-presto1008 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:44:15] PROBLEM - Check systemd state on an-presto1012 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:44:15] PROBLEM - Check systemd state on an-presto1009 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:44:27] PROBLEM - Check systemd state on an-presto1013 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:02] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on 10 hosts with reason: Still not ready to add these new presto servers to the cluster - btullis [11:47:21] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on 10 hosts with reason: Still not ready to add these new presto servers to the cluster - btullis [11:54:21] !log upgraded cumin on cumin1001 to 4.2.0-1+deb11u1 [11:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:25] (03PS1) 10Marostegui: Revert "db1176: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/881043 [12:06:52] (03CR) 10Marostegui: [C: 03+2] Revert "db1176: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/881043 (owner: 10Marostegui) [12:07:05] (03PS1) 10Slyngshede: D:apereo_cas::service Allow OIDC to define release policies [puppet] - 10https://gerrit.wikimedia.org/r/881365 [12:07:28] (03CR) 10CI reject: [V: 04-1] D:apereo_cas::service Allow OIDC to define release policies [puppet] - 10https://gerrit.wikimedia.org/r/881365 (owner: 10Slyngshede) [12:08:46] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for rpcbind [puppet] - 10https://gerrit.wikimedia.org/r/881386 (https://phabricator.wikimedia.org/T135991) [12:09:26] (03PS2) 10Slyngshede: D:apereo_cas::service Allow OIDC to define release policies [puppet] - 10https://gerrit.wikimedia.org/r/881365 [12:09:46] (03CR) 10CI reject: [V: 04-1] D:apereo_cas::service Allow OIDC to define release policies [puppet] - 10https://gerrit.wikimedia.org/r/881365 (owner: 10Slyngshede) [12:15:50] (03PS1) 10Dreamy Jazz: Pin CheckUserEventTablesMigrationStage to read and write old [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881390 (https://phabricator.wikimedia.org/T324907) [12:16:31] (03PS3) 10Slyngshede: D:apereo_cas::service Allow OIDC to define release policies [puppet] - 10https://gerrit.wikimedia.org/r/881365 [12:17:47] (03PS1) 10Muehlenhoff: clouddumps: Enable profile::auto_restarts::service for rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/881391 (https://phabricator.wikimedia.org/T135991) [12:19:27] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39167/console" [puppet] - 10https://gerrit.wikimedia.org/r/881365 (owner: 10Slyngshede) [12:20:08] 10SRE-swift-storage: Rclone is fussy about missing objects - https://phabricator.wikimedia.org/T327269 (10MatthewVernon) [12:20:42] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) [12:22:01] 10SRE-swift-storage: Rclone is fussy about missing objects - https://phabricator.wikimedia.org/T327269 (10MatthewVernon) I note in passing that we didn't pick this up in testing because in dry-run mode `rclone` (not unreasonably) tells you what objects it would try and copy but doesn't actually try to read the s... [12:23:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:24:56] (03PS4) 10Slyngshede: D:apereo_cas::service Allow OIDC to define release policies [puppet] - 10https://gerrit.wikimedia.org/r/881365 [12:25:20] (03PS1) 10MSantos: mobileapps: bump to 2023-01-18-011911-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/881392 [12:25:57] (03PS1) 10WMDE-Fisch: Revert "Breaking upgrade: mapdata" [extensions/Kartographer] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/881045 (https://phabricator.wikimedia.org/T327151) [12:30:43] Hi, we couldn't deploy mobileapps yesterday because of the codfw network outage/broken CI. Is it OK if we deploy to production now outside of regular deployment window? [12:34:58] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for nfs-idmapd [puppet] - 10https://gerrit.wikimedia.org/r/881393 (https://phabricator.wikimedia.org/T135991) [12:35:29] (03PS5) 10Slyngshede: D:apereo_cas::service Allow OIDC to define release policies [puppet] - 10https://gerrit.wikimedia.org/r/881365 [12:37:37] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39169/console" [puppet] - 10https://gerrit.wikimedia.org/r/881365 (owner: 10Slyngshede) [12:41:33] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate SSH bastions to Bullseye - https://phabricator.wikimedia.org/T324974 (10kostajh) I updated https://wikitech.wikimedia.org/wiki/Template:BastionMap and created pages for the new bastions, but as discussed on ops@lists.wikimedia.org, the ssh fing... [12:42:03] (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 2023-01-18-011911-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/881392 (owner: 10MSantos) [12:47:18] (03Merged) 10jenkins-bot: mobileapps: bump to 2023-01-18-011911-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/881392 (owner: 10MSantos) [12:49:20] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for blkmapd [puppet] - 10https://gerrit.wikimedia.org/r/881399 (https://phabricator.wikimedia.org/T135991) [12:51:14] (03CR) 10Ladsgroup: "For my patch that I sent, I didn't have the override in IS.php, this has a wrong override and might not work, let me test it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879659 (https://phabricator.wikimedia.org/T326892) (owner: 10Jdlrobson) [13:02:43] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Revert "Breaking upgrade: mapdata" [extensions/Kartographer] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/881045 (https://phabricator.wikimedia.org/T327151) (owner: 10WMDE-Fisch) [13:06:55] (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [13:09:41] this was actually due to a spike yesterday I think ^ [13:13:24] 10SRE, 10observability, 10Patch-For-Review, 10User-fgiunchedi: service implementation tracking: arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319434 (10fgiunchedi) [13:16:47] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on webperf1004.eqiad.wmnet with reason: decom [13:17:11] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on webperf1004.eqiad.wmnet with reason: decom [13:21:55] (LogstashIngestSpike) resolved: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [13:22:40] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:22:58] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:24:00] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 20 Feb 2023 05:31:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:24:07] (03CR) 10Ladsgroup: "Nope, works as expected 😊 I'd put it slightly lower but meh, it's temporary." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879659 (https://phabricator.wikimedia.org/T326892) (owner: 10Jdlrobson) [13:24:20] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49421 bytes in 2.418 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:36:06] this just got mentioned in the Wikimedia Discord, but it seems that on all group0 and group1 wikis displaying the desktop version of legacy vector on a phone got borked; https://phabricator.wikimedia.org/T327256 [13:45:19] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate SSH bastions to Bullseye - https://phabricator.wikimedia.org/T324974 (10MoritzMuehlenhoff) >>! In T324974#8534698, @kostajh wrote: > I updated https://wikitech.wikimedia.org/wiki/Template:BastionMap and created pages for the new bastions, but a... [13:53:16] (03PS2) 10Ottomata: flink 1.16.0-wmf3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/881011 (https://phabricator.wikimedia.org/T316519) [13:53:23] (03PS3) 10Ottomata: flink 1.16.0-wmf3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/881011 (https://phabricator.wikimedia.org/T316519) [13:53:28] (03CR) 10Ottomata: [V: 03+2 C: 03+2] flink 1.16.0-wmf3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/881011 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [13:59:53] (03CR) 10Atieno: [C: 03+1] Update Thumbor repository according to the latest changes [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/876229 (https://phabricator.wikimedia.org/T325811) (owner: 10Vlad.shapik) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230118T1400). [14:00:05] Dreamy_Jazz, WMDE-Fisch, and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:35] #o [14:00:38] hi [14:00:42] \o [14:01:22] (03CR) 10Ladsgroup: [10%] English Wikipedia uses Vector 2022 skin (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879659 (https://phabricator.wikimedia.org/T326892) (owner: 10Jdlrobson) [14:01:29] Would be nice, if someone could deploy my backport. I'm currently in a sick child busy situation at home ^^' [14:01:37] I can deploy in a few minutes [14:02:02] (03CR) 10Ladsgroup: [10%] English Wikipedia uses Vector 2022 skin (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879659 (https://phabricator.wikimedia.org/T326892) (owner: 10Jdlrobson) [14:02:10] (03CR) 10Jgiannelos: [C: 03+1] Enable Linter write namespace tag and template using core config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880989 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [14:02:17] (03PS3) 10Jgiannelos: Enable Linter write namespace tag and template using core config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880989 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [14:05:32] Hi there [14:07:02] I'm around for the backport window and can test. [14:09:48] (03PS1) 10Phedenskog: prometheus: recording rules for CPU benchmark without labels [puppet] - 10https://gerrit.wikimedia.org/r/881411 (https://phabricator.wikimedia.org/T321398) [14:10:59] o/ [14:11:01] alright, now I can deploy [14:11:42] let’s +2 WMDE-Fisch and then start with Dreamy_Jazz [14:11:54] \o/ [14:12:13] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "+2 to kick off gate-and-submit" [extensions/Kartographer] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/881045 (https://phabricator.wikimedia.org/T327151) (owner: 10WMDE-Fisch) [14:13:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879946 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [14:13:14] (03PS4) 10Lucas Werkmeister (WMDE): Write to cul_reason[_plaintext]_id everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879946 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [14:13:25] (03CR) 10TrainBranchBot: "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879946 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [14:14:26] (03Merged) 10jenkins-bot: Write to cul_reason[_plaintext]_id everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879946 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [14:14:50] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:879946|Write to cul_reason[_plaintext]_id everywhere (T233004)]] [14:14:54] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [14:15:13] (03CR) 10Phedenskog: "I'm not sure about the naming starting with the labels. What I'm thinking is that when we start to add all our metrics and will have many " [puppet] - 10https://gerrit.wikimedia.org/r/881411 (https://phabricator.wikimedia.org/T321398) (owner: 10Phedenskog) [14:16:35] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and dreamyjazz: Backport for [[gerrit:879946|Write to cul_reason[_plaintext]_id everywhere (T233004)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [14:16:49] Dreamy_Jazz: the change should be on mwdebug [14:16:56] can it be tested? [14:17:23] maybe you can do a checkuser action and I can peek at the corresponding database row? (assuming you don’t have database access yourself) [14:21:42] Yeah. It should be testable. Didn't hear the ping. [14:21:45] Will test now [14:21:48] ok, thanks [14:22:46] (looks like the Kartographer backport might finish gate-and-submit too soon, but that’s not a problem) [14:23:24] !log installing mod-wsgi security updates [14:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:44] I've made a test check [14:24:14] should I look at the database anywhere or is the test finished? [14:24:15] If you look for checks on "Dreamy Jazz Bot" [14:24:32] which wiki? or is it the central db? [14:24:36] enwiki [14:26:16] (03Merged) 10jenkins-bot: Revert "Breaking upgrade: mapdata" [extensions/Kartographer] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/881045 (https://phabricator.wikimedia.org/T327151) (owner: 10WMDE-Fisch) [14:26:18] cu_log should have an entry for a check on "Dreamy Jazz Bot" (my latest check) which should have the two columns filled with comment IDs. Those comment IDs should be for "[test] testing config change made in [[gerrit:879946]]" (cul_reason_id) and "[test] testing config change made in gerrit:879946" (cul_plaintext_reason_id) [14:26:25] found the row [14:26:44] I see the two reason IDs [14:26:48] let me check the comment table [14:27:05] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for nfs.mountd [puppet] - 10https://gerrit.wikimedia.org/r/881413 (https://phabricator.wikimedia.org/T135991) [14:27:22] indeed, LGTM! [14:27:23] thanks [14:27:27] continuing sync [14:27:31] Thanks for helping testing. [14:28:44] I also notice cul_user(_text) are empty, I assume those already got replaced by cul_actor :) [14:30:51] Yes. Those I think were recently stopped writing to [14:31:29] it'd be nice if you could +2 my backports ahead of time, since the deployments recently tend to take forever [14:31:36] brb in 5 minutes [14:32:05] MatmaRex: I’ll do that once I start the Kartographer backport [14:34:23] Yup. https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/880903 meant that cul_user(_text) are no longer written to [14:34:36] nice [14:34:44] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:879946|Write to cul_reason[_plaintext]_id everywhere (T233004)]] (duration: 19m 54s) [14:34:48] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [14:35:27] Thanks! [14:35:30] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:881045|Revert "Breaking upgrade: mapdata" (T327151)]] [14:35:33] T327151: Defective Kartographer maps in Wikivoyge: empty maps and wrong group names - https://phabricator.wikimedia.org/T327151 [14:35:45] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Revert gallery changes in 1.40.0-wmf.18 [core] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/880920 (https://phabricator.wikimedia.org/T326990) (owner: 10Bartosz Dziewoński) [14:37:17] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and wmde-fisch: Backport for [[gerrit:881045|Revert "Breaking upgrade: mapdata" (T327151)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [14:37:34] Testing.... [14:37:47] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:51] https://en.wikivoyage.org/wiki/Edfu looks good to me after a Ctrl+F5, at least [14:38:03] (I could see the “Group: 0” before) [14:38:35] Jepp [14:38:43] And also the expandable map is fine again. [14:38:52] okay, deploying then [14:38:53] thanks! [14:39:14] Yes! Thank you! [14:42:00] (03PS6) 10Jdrewniak: [10%] English Wikipedia uses Vector 2022 skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879659 (https://phabricator.wikimedia.org/T326892) (owner: 10Jdlrobson) [14:44:23] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Revert gallery changes in 1.40.0-wmf.18 & .19 [core] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/880921 (https://phabricator.wikimedia.org/T326990) (owner: 10Bartosz Dziewoński) [14:46:03] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:881045|Revert "Breaking upgrade: mapdata" (T327151)]] (duration: 10m 33s) [14:46:07] T327151: Defective Kartographer maps in Wikivoyge: empty maps and wrong group names - https://phabricator.wikimedia.org/T327151 [14:46:27] (03PS3) 10Lucas Werkmeister (WMDE): Enable the REST API on test-wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878927 (https://phabricator.wikimedia.org/T324999) (owner: 10Michael Große) [14:48:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/880920 (https://phabricator.wikimedia.org/T326990) (owner: 10Bartosz Dziewoński) [14:48:07] (03PS7) 10Jdrewniak: [10%] English Wikipedia uses Vector 2022 skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879659 (https://phabricator.wikimedia.org/T326892) (owner: 10Jdlrobson) [14:51:07] (03Merged) 10jenkins-bot: Revert gallery changes in 1.40.0-wmf.18 [core] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/880920 (https://phabricator.wikimedia.org/T326990) (owner: 10Bartosz Dziewoński) [14:51:26] (03CR) 10Jdlrobson: [C: 04-1] "per slack let's add instruemntation to 0.05 in this patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881021 (https://phabricator.wikimedia.org/T326892) (owner: 10Jdlrobson) [14:51:33] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:880920|Revert gallery changes in 1.40.0-wmf.18 (T326990)]] [14:51:37] T326990: gallery tag with mode=slideshow and caption=value does not display full image - https://phabricator.wikimedia.org/T326990 [14:51:59] 10SRE-swift-storage: Rclone is fussy about missing objects - https://phabricator.wikimedia.org/T327269 (10MatthewVernon) p:05Triage→03High [14:52:13] 10SRE-swift-storage: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) p:05Triage→03Medium [14:53:21] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and matmarex: Backport for [[gerrit:880920|Revert gallery changes in 1.40.0-wmf.18 (T326990)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [14:54:20] MatmaRex: can you test the wmf.18 version of the backport? [14:54:27] lookng [14:54:30] ok [14:54:47] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff) [14:55:22] thcipriani: I think the scap for the last backport (currently in gate-and-submit, zuul predicts ETA 4 min) might run into the next window a bit :/ [14:55:25] hopefully not too much [14:56:16] (I also added another config change to the window but there won’t be time for that, I’ll do that later instead) [14:56:48] Lucas_WMDE: looks good [14:56:55] ok, thanks [14:56:55] syncing [14:57:02] !log uploaded python-jose 3.3.0+dfsg-4~wmf11u1 to apt.wikmedia.org (needed by python-social-auth/Bitu) [14:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:16] (03PS1) 10Andrew Bogott: Horizon: replace puppetized maintenance mode flag with file check [puppet] - 10https://gerrit.wikimedia.org/r/881417 (https://phabricator.wikimedia.org/T327190) [14:57:41] (sorry for the delay, i was a little confused by my test page. it has some weird layout unrelated to the bug https://en.wikipedia.org/wiki/117th_United_States_Congress#Party_summary ) [14:58:05] MatmaRex: do you think the wmf.19 version should be tested separately or should I just sync that right away? [14:58:21] <_joe_> !issync [14:58:22] Syncing #wikimedia-operations (requested by joe_oblivian) [14:58:23] Set /cs flags #wikimedia-operations claime +Aiotv [14:58:25] Set /cs flags #wikimedia-operations kavitha +Aiotv [14:58:54] (03CR) 10Xcollazo: Add a systemd timer to clean up old data related to image_suggestions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/870974 (https://phabricator.wikimedia.org/T323614) (owner: 10Xcollazo) [14:59:17] Lucas_WMDE: i suppose you can sync, and i can test afterwards [14:59:23] ok [14:59:28] sukhe: regarding yesterday's 'labweb-ssl:7443 has failed probes' alert (T327190): I've been having Horizon return 503 during maintenance, which seems like the right code for 'down for maintenance'. Should I return something different? Or should we not alert on 503? [14:59:29] T327190: Improve horizon downtime process - https://phabricator.wikimedia.org/T327190 [14:59:32] I don’t want to take up too much of the Vector 2022 deplyoment window [15:00:04] I almost switched it to 200 just to keep the probe happy but that feels wrong [15:00:05] thcipriani: It is that lovely time of the day again! You are hereby commanded to deploy Vector 2022 deployment. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230118T1500). [15:00:22] o/ [15:00:24] thcipriani: I’m still syncing, sorry [15:00:28] andrewbogott: my understanding is that we should not alert for it if it under maintenance [15:00:43] here's a test page for wmf.19 https://www.mediawiki.org/wiki/Reading/Web/Desktop_Improvements#What_features_will_be_added [15:00:46] has it been downtimed since? AFAIK we didn't get any more alerts [15:00:54] Lucas_WMDE: no worries, ping me or jan_drewniak when you've finished up [15:00:57] the last backport is almost done and I’d prefer to sync it as well (but if you want, I can remove the +2s again and postpone) [15:00:59] ok, thanks [15:01:01] will do [15:01:01] no, it was down for 90 minutes and then I finished :) [15:01:02] !log cp2031: rebooting to gather more information (still downtimed + depooled) [15:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:04] (03Merged) 10jenkins-bot: Revert gallery changes in 1.40.0-wmf.18 & .19 [core] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/880921 (https://phabricator.wikimedia.org/T326990) (owner: 10Bartosz Dziewoński) [15:02:18] here's the apache conf if that helps: https://gerrit.wikimedia.org/r/c/operations/puppet/+/881417/1/modules/openstack/templates/zed/horizon/labtesthorizon.wikimedia.org.erb [15:03:09] I can take labtesthorizon in and out of maintenance if that helps, although that already has alerting suppressed. [15:03:15] andrewbogott: I think (personally and what I do), downtiming is better, just in case there are other alerts too [15:03:17] (03PS1) 10Elukey: kserve: fix missing comma in template [deployment-charts] - 10https://gerrit.wikimedia.org/r/881418 (https://phabricator.wikimedia.org/T325528) [15:03:25] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [15:04:19] sukhe: agreed, was thinking there might also be a bug in our alerting though. [15:04:38] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:880920|Revert gallery changes in 1.40.0-wmf.18 (T326990)]] (duration: 13m 04s) [15:04:41] T326990: gallery tag with mode=slideshow and caption=value does not display full image - https://phabricator.wikimedia.org/T326990 [15:05:04] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "nice comma!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/881418 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [15:05:05] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:880921|Revert gallery changes in 1.40.0-wmf.18 & .19 (T326990)]] [15:05:58] RECOVERY - traffic_server backend process restarted on cp2031 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=codfw+prometheus/ops&var-instance=cp2031&var-layer=backend [15:06:51] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and matmarex: Backport for [[gerrit:880921|Revert gallery changes in 1.40.0-wmf.18 & .19 (T326990)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [15:07:17] MatmaRex: thanks, I was able to quickly check the change and it looks good to me [15:07:18] syncing [15:07:26] and looks good to me as well [15:07:28] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc1052.eqiad.wmnet with OS bullseye [15:07:32] thanks [15:11:41] it’s restarting php-fpm now [15:13:00] !log cp2031: rebooting to gather more information (still downtimed + depooled) [15:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:17] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:880921|Revert gallery changes in 1.40.0-wmf.18 & .19 (T326990)]] (duration: 09m 11s) [15:14:18] thcipriani, jan_drewniak: I’m done syncing now, sorry again for the delay [15:14:20] T326990: gallery tag with mode=slideshow and caption=value does not display full image - https://phabricator.wikimedia.org/T326990 [15:14:37] thanks Lucas_WMDE :) [15:14:53] (03PS1) 10Volans: role::wmcs::cloudlb: set contacts hiera [puppet] - 10https://gerrit.wikimedia.org/r/881421 [15:15:49] (03PS1) 10Muehlenhoff: slapd: Add support to configure MDB storage backend (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/881422 (https://phabricator.wikimedia.org/T292942) [15:16:34] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879659 (https://phabricator.wikimedia.org/T326892) (owner: 10Jdlrobson) [15:17:22] (03Merged) 10jenkins-bot: [10%] English Wikipedia uses Vector 2022 skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879659 (https://phabricator.wikimedia.org/T326892) (owner: 10Jdlrobson) [15:17:41] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/881421 (owner: 10Volans) [15:17:47] !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:879659|[10%] English Wikipedia uses Vector 2022 skin (T326892)]] [15:17:51] T326892: Make Vector 2022 the default skin on English Wikipedia - https://phabricator.wikimedia.org/T326892 [15:18:05] (03CR) 10Volans: [C: 03+2] role::wmcs::cloudlb: set contacts hiera [puppet] - 10https://gerrit.wikimedia.org/r/881421 (owner: 10Volans) [15:18:43] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1052.eqiad.wmnet with reason: host reimage [15:19:36] !log jdrewniak@deploy1002 jdrewniak and jdlrobson: Backport for [[gerrit:879659|[10%] English Wikipedia uses Vector 2022 skin (T326892)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [15:21:05] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1052.eqiad.wmnet with reason: host reimage [15:27:05] (03PS2) 10Jdlrobson: [25%] English Wikipedia uses Vector 2022 skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881020 (https://phabricator.wikimedia.org/T326892) [15:27:12] (03PS2) 10Jdlrobson: [50%] English Wikipedia uses Vector 2022 skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881021 (https://phabricator.wikimedia.org/T326892) [15:29:17] !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:879659|[10%] English Wikipedia uses Vector 2022 skin (T326892)]] (duration: 11m 30s) [15:29:20] T326892: Make Vector 2022 the default skin on English Wikipedia - https://phabricator.wikimedia.org/T326892 [15:30:34] (03PS3) 10Jdlrobson: [50%] English Wikipedia uses Vector 2022 skin, adds instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881021 (https://phabricator.wikimedia.org/T326892) [15:31:46] !log re-enabling Cassandra hinted-handoff for codfw -- T327001 [15:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:50] T327001: asw-b2-codfw down - https://phabricator.wikimedia.org/T327001 [15:32:14] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: replace puppetized maintenance mode flag with file check [puppet] - 10https://gerrit.wikimedia.org/r/881417 (https://phabricator.wikimedia.org/T327190) (owner: 10Andrew Bogott) [15:33:18] (03CR) 10Elukey: [C: 03+2] kserve: fix missing comma in template [deployment-charts] - 10https://gerrit.wikimedia.org/r/881418 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [15:33:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881020 (https://phabricator.wikimedia.org/T326892) (owner: 10Jdlrobson) [15:34:36] 10SRE, 10SRE-OnFire, 10ops-codfw, 10Sustainability (Incident Followup): asw-b2-codfw down - https://phabricator.wikimedia.org/T327001 (10Eevans) >>! In T327001#8534084, @MatthewVernon wrote: > @Eevans do you still need to re-enable Cassandra hints in codfw row b? {{done}} [15:34:44] (03Merged) 10jenkins-bot: [25%] English Wikipedia uses Vector 2022 skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881020 (https://phabricator.wikimedia.org/T326892) (owner: 10Jdlrobson) [15:35:10] !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:881020|[25%] English Wikipedia uses Vector 2022 skin (T326892)]] [15:35:13] T326892: Make Vector 2022 the default skin on English Wikipedia - https://phabricator.wikimedia.org/T326892 [15:35:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10RobH) [15:35:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10RobH) [15:36:06] hm, the vector window is three hours long now? [15:36:58] !log jdrewniak@deploy1002 jdrewniak and jdlrobson: Backport for [[gerrit:881020|[25%] English Wikipedia uses Vector 2022 skin (T326892)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [15:37:06] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:37:19] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:37:23] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1052.eqiad.wmnet with OS bullseye [15:38:58] Lucas_WMDE: Yes. [15:39:10] Lucas_WMDE: Biggest deploy of the year, etc. etc. [15:39:29] it does seem more realistic than the previous duration ^^ [15:39:38] but I’ll have to see where I can fit my config change now [15:42:44] Lucas_WMDE: I padded it out to give folks time to test [15:43:07] unsure if it'll take that whole time [15:43:51] (03PS1) 10Ottomata: flink-app-example Set egress enabled in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/881424 (https://phabricator.wikimedia.org/T324576) [15:44:16] !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:881020|[25%] English Wikipedia uses Vector 2022 skin (T326892)]] (duration: 09m 06s) [15:44:20] T326892: Make Vector 2022 the default skin on English Wikipedia - https://phabricator.wikimedia.org/T326892 [15:48:34] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881021 (https://phabricator.wikimedia.org/T326892) (owner: 10Jdlrobson) [15:49:19] (03Merged) 10jenkins-bot: [50%] English Wikipedia uses Vector 2022 skin, adds instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881021 (https://phabricator.wikimedia.org/T326892) (owner: 10Jdlrobson) [15:49:41] !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:881021|[50%] English Wikipedia uses Vector 2022 skin, adds instrumentation (T326892)]] [15:49:44] T326892: Make Vector 2022 the default skin on English Wikipedia - https://phabricator.wikimedia.org/T326892 [15:50:18] thcipriani: if you could let me know if you finish early, that’d be great (but otherwise it’s also okay) [15:50:23] best of luck with the deployment! [15:51:27] !log jdrewniak@deploy1002 jdrewniak and jdlrobson: Backport for [[gerrit:881021|[50%] English Wikipedia uses Vector 2022 skin, adds instrumentation (T326892)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [15:51:49] (03CR) 10Btullis: [C: 03+2] Add a systemd timer to clean up old data related to image_suggestions [puppet] - 10https://gerrit.wikimedia.org/r/870974 (https://phabricator.wikimedia.org/T323614) (owner: 10Xcollazo) [15:52:13] 7 [15:52:20] err :) [15:53:49] (03CR) 10Ottomata: [C: 03+2] flink-app-example Set egress enabled in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/881424 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [15:54:10] 10SRE-swift-storage: Rclone is fussy about missing objects - https://phabricator.wikimedia.org/T327269 (10Eevans) > Revert to swiftrepl (which doesn't care) until T327253 is fixed Apologies if this should be obvious, but what are `swiftrepl`'s semantics here? I assume it's meant to propagate deletions from one... [15:58:34] !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:881021|[50%] English Wikipedia uses Vector 2022 skin, adds instrumentation (T326892)]] (duration: 08m 52s) [15:58:38] T326892: Make Vector 2022 the default skin on English Wikipedia - https://phabricator.wikimedia.org/T326892 [15:59:49] (03Merged) 10jenkins-bot: flink-app-example Set egress enabled in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/881424 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [15:59:55] Lucas_WMDE: will do :) [16:01:10] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.131 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:01:40] sukhe: ^ [16:01:45] should we worry? [16:01:52] yeah not sure, looking [16:01:54] that's cr3-eqsin [16:02:20] oh and it says that in the alert :D I just saw the IP address and missed the hostname [16:02:37] I am not sure it's realted [16:02:39] *related [16:02:45] XioNoX: ^ around? [16:02:46] shall we continue sukhe ? [16:02:46] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:02:50] ok recovery [16:02:58] flapping [16:03:02] I don't think it was related but good to check [16:03:04] XioNoX: bblack: ^ [16:03:10] sorry, double-ping for XioNo.X [16:03:13] muscle [16:03:21] Amir1: for now assume everything's ok, IMHO [16:03:27] until we learn otherwise :) [16:03:32] ok [16:03:46] (03PS1) 10Cwhite: logstash: extend alert logs retention to 5y [puppet] - 10https://gerrit.wikimedia.org/r/881368 (https://phabricator.wikimedia.org/T304924) [16:04:11] it was a loss of SNMP responses for a short time, but we have no evidence of "real" network-layer problems (yet) [16:04:12] yeah looks like it missed a check [16:04:26] yeah, was wondering if the SNMP miss is evidence of something else :P [16:04:38] we'd probably see stronger other evidence by now [16:06:20] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/services/flink-app-example: apply [16:06:21] 10SRE-swift-storage: Rclone is fussy about missing objects - https://phabricator.wikimedia.org/T327269 (10MatthewVernon) Yeah, `swifrepl` propagates deletions, and seems not to care about missing objects; reverting to using it instead of `rclone` temporarily is my option 3. [16:06:25] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/flink-app-example: apply [16:09:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881022 (https://phabricator.wikimedia.org/T326892) (owner: 10Jdlrobson) [16:09:34] (03PS2) 10Jdrewniak: [75%] English Wikipedia uses Vector 2022 skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881022 (https://phabricator.wikimedia.org/T326892) (owner: 10Jdlrobson) [16:10:03] (03CR) 10TrainBranchBot: "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881022 (https://phabricator.wikimedia.org/T326892) (owner: 10Jdlrobson) [16:10:38] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/881368 (https://phabricator.wikimedia.org/T304924) (owner: 10Cwhite) [16:10:51] (03Merged) 10jenkins-bot: [75%] English Wikipedia uses Vector 2022 skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881022 (https://phabricator.wikimedia.org/T326892) (owner: 10Jdlrobson) [16:11:06] 10SRE-swift-storage: Rclone is fussy about missing objects - https://phabricator.wikimedia.org/T327269 (10Eevans) >>! In T327269#8535481, @MatthewVernon wrote: > Yeah, `swifrepl` propagates deletions, and seems not to care about missing objects; reverting to using it instead of `rclone` temporarily is my option... [16:11:15] !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:881022|[75%] English Wikipedia uses Vector 2022 skin (T326892)]] [16:11:19] T326892: Make Vector 2022 the default skin on English Wikipedia - https://phabricator.wikimedia.org/T326892 [16:13:03] !log jdrewniak@deploy1002 jdrewniak and jdlrobson: Backport for [[gerrit:881022|[75%] English Wikipedia uses Vector 2022 skin (T326892)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [16:14:19] (03CR) 10Cwhite: [C: 03+2] logstash: extend alert logs retention to 5y [puppet] - 10https://gerrit.wikimedia.org/r/881368 (https://phabricator.wikimedia.org/T304924) (owner: 10Cwhite) [16:16:50] 10SRE-tools, 10Infrastructure-Foundations: Cookbook for rack downtime - https://phabricator.wikimedia.org/T327300 (10ayounsi) [16:19:13] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: recording rules for CPU benchmark without labels (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881411 (https://phabricator.wikimedia.org/T321398) (owner: 10Phedenskog) [16:19:21] 10SRE-swift-storage: Rclone is fussy about missing objects - https://phabricator.wikimedia.org/T327269 (10MatthewVernon) OIC. Yes, I think so [with a caveat that I don't know how `swiftrepl` deals with other failures] [16:20:39] !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:881022|[75%] English Wikipedia uses Vector 2022 skin (T326892)]] (duration: 09m 24s) [16:20:44] T326892: Make Vector 2022 the default skin on English Wikipedia - https://phabricator.wikimedia.org/T326892 [16:23:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:23:13] (03PS1) 10Filippo Giunchedi: webperf: update rsync source/dest with the new reality [puppet] - 10https://gerrit.wikimedia.org/r/881449 (https://phabricator.wikimedia.org/T319434) [16:26:32] (03PS2) 10Jdrewniak: [100%] English Wikipedia uses Vector 2022 skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881023 (owner: 10Jdlrobson) [16:28:50] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881023 (owner: 10Jdlrobson) [16:29:34] (03Merged) 10jenkins-bot: [100%] English Wikipedia uses Vector 2022 skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881023 (owner: 10Jdlrobson) [16:29:56] !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:881023|[100%] English Wikipedia uses Vector 2022 skin]] [16:31:40] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/881449 (https://phabricator.wikimedia.org/T319434) (owner: 10Filippo Giunchedi) [16:31:42] !log jdrewniak@deploy1002 jdrewniak and jdlrobson: Backport for [[gerrit:881023|[100%] English Wikipedia uses Vector 2022 skin]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [16:39:24] !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:881023|[100%] English Wikipedia uses Vector 2022 skin]] (duration: 09m 27s) [16:40:01] 100% \o/ [16:40:05] ^ [16:41:55] (I’m still holding off for now, in case you want to test or roll back) [16:41:56] but, exciting! [16:45:40] !log jnuche@deploy1002 Installing scap version "4.33.0" for 1 hosts [16:45:50] !log jnuche@deploy1002 Installation of scap version "4.33.0" completed for 1 hosts [16:50:01] (03PS2) 10AOkoth: vrts: add vrts2001 hieradata and database port [puppet] - 10https://gerrit.wikimedia.org/r/880488 (https://phabricator.wikimedia.org/T323515) [16:50:23] (03CR) 10AOkoth: vrts: add vrts2001 hieradata and database port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/880488 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [16:52:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10BTullis) Hi @Papaul - apologies for the delay in getting back to you. Please could we use the `partman/raid10-8dev.cfg` recipe? I... [16:52:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10BTullis) [16:55:02] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T319126 (10Papaul) 05Open→03Resolved A checked this it is showing 5.77A which is good. We can resolve this task. Phase, AA:L1-L2, Current 5.77 A [16:57:44] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash1036'] [16:57:53] (03CR) 10Filippo Giunchedi: [C: 03+2] webperf: update rsync source/dest with the new reality [puppet] - 10https://gerrit.wikimedia.org/r/881449 (https://phabricator.wikimedia.org/T319434) (owner: 10Filippo Giunchedi) [16:59:57] (03PS1) 10Jdlrobson: Bump English Wikipedia event logging from 0.5 to 1% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881451 (https://phabricator.wikimedia.org/T326892) [17:00:21] (03CR) 10Herron: [C: 03+1] role, profile: remove elasticsearch role and supporting profile [puppet] - 10https://gerrit.wikimedia.org/r/879889 (owner: 10Cwhite) [17:00:44] (03CR) 10Herron: [C: 03+1] role: remove kibana7_ecs role [puppet] - 10https://gerrit.wikimedia.org/r/879888 (owner: 10Cwhite) [17:01:14] (03CR) 10Herron: [C: 03+1] role, profile: remove logstash(7) role and hiera config [puppet] - 10https://gerrit.wikimedia.org/r/879887 (owner: 10Cwhite) [17:01:33] (03CR) 10Herron: [C: 03+1] profile: clean up legacy apifeatureusage class [puppet] - 10https://gerrit.wikimedia.org/r/879886 (owner: 10Cwhite) [17:05:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=False) upgrade firmware for hosts ['logstash1036'] [17:07:44] (03PS1) 10Filippo Giunchedi: site: apply webperf::profiling_tools to arclamp2001 [puppet] - 10https://gerrit.wikimedia.org/r/881452 (https://phabricator.wikimedia.org/T319429) [17:07:46] (03PS1) 10Filippo Giunchedi: Move arclamp to arclamp2001 [puppet] - 10https://gerrit.wikimedia.org/r/881453 (https://phabricator.wikimedia.org/T319429) [17:09:38] (03PS1) 10Cwhite: logstash: move blackbox-exporter logs to ecs-promblkboxexp indexes [puppet] - 10https://gerrit.wikimedia.org/r/881370 (https://phabricator.wikimedia.org/T327308) [17:09:39] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash1037'] [17:10:18] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['logstash1037'] [17:10:46] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash1037'] [17:12:08] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [17:12:41] (03CR) 10Cwhite: "The partition name "promblkboxexp" is meh. Tried to keep the name short. Open to alternatives." [puppet] - 10https://gerrit.wikimedia.org/r/881370 (https://phabricator.wikimedia.org/T327308) (owner: 10Cwhite) [17:13:44] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [17:19:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=False) upgrade firmware for hosts ['logstash1037'] [17:20:56] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [17:22:31] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [17:23:34] (03PS1) 10Jdlrobson: Legacy Vector is not a responsive skin [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/881431 (https://phabricator.wikimedia.org/T327256) [17:23:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10Papaul) [17:25:25] (03PS1) 10Ottomata: flink-app 0.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881455 (https://phabricator.wikimedia.org/T324576) [17:35:36] !log jnuche@deploy1002 Installing scap version "4.33.0" for 561 hosts [17:36:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10Papaul) @RobH this is okay now after @Jclark-ctr reseted the NIC [17:40:02] (03CR) 10Ottomata: [C: 03+2] flink-app 0.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881455 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [17:41:54] !log btullis@deploy1002 Installing scap version "4.33.0" for 1 hosts [17:42:04] !log btullis@deploy1002 Installation of scap version "4.33.0" completed for 1 hosts [17:42:49] !log jnuche@deploy1002 install-world aborted: (duration: 07m 17s) [17:44:36] !log jnuche@deploy1002 Installing scap version "4.33.0" for 560 hosts [17:44:39] (03Merged) 10jenkins-bot: flink-app 0.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881455 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [17:44:59] !log jnuche@deploy1002 Installation of scap version "4.33.0" completed for 560 hosts [17:51:26] jouncebot: now [17:51:27] For the next 0 hour(s) and 8 minute(s): Vector 2022 deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230118T1500) [17:53:52] jouncebot: next [17:53:53] In 0 hour(s) and 6 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230118T1800) [17:54:28] nah, that’s not enough time for a config change I think. I’ll do it later [17:54:55] unless the people responsible for that window (the calendar doesn’t have individual names) tell me they don’t need it and I can go ahead ;) [17:55:20] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/services/flink-app-example: apply [17:55:41] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/flink-app-example: apply [17:58:32] (03Abandoned) 10AOkoth: Revert "Revert "vrts: add vrts2001 values and add database port in config"" [puppet] - 10https://gerrit.wikimedia.org/r/869717 (owner: 10AOkoth) [18:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230118T1800) [18:09:56] Lucas_WMDE: we have nothing to do in that window, go ahead [18:10:06] claime: ok, thanks a lot! [18:10:42] (03PS4) 10Lucas Werkmeister (WMDE): Enable the REST API on test-wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878927 (https://phabricator.wikimedia.org/T324999) (owner: 10Michael Große) [18:11:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878927 (https://phabricator.wikimedia.org/T324999) (owner: 10Michael Große) [18:11:52] (03Merged) 10jenkins-bot: Enable the REST API on test-wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878927 (https://phabricator.wikimedia.org/T324999) (owner: 10Michael Große) [18:12:16] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:878927|Enable the REST API on test-wikidata (T324999)]] [18:12:20] T324999: configure Wikibase REST API on Wikidata - https://phabricator.wikimedia.org/T324999 [18:14:07] !log lucaswerkmeister-wmde@deploy1002 migr and lucaswerkmeister-wmde: Backport for [[gerrit:878927|Enable the REST API on test-wikidata (T324999)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [18:14:38] https://test.wikidata.org/w/rest.php/wikibase/v0/entities/items/Q213573 works on mwdebug, I’ll continue syncing [18:21:54] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:878927|Enable the REST API on test-wikidata (T324999)]] (duration: 09m 38s) [18:21:58] T324999: configure Wikibase REST API on Wikidata - https://phabricator.wikimedia.org/T324999 [18:22:18] and now it also works without mwdebug [18:22:23] I think I’m done :) [18:23:39] (03PS1) 10Bking: flink-operator: bump version to 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) [18:27:15] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team: Spicerack: Add CI step to test with wmcs cookbooks - https://phabricator.wikimedia.org/T325758 (10fnegri) [18:27:41] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team: Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10fnegri) [18:27:49] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team: Update Spicerack documentation - https://phabricator.wikimedia.org/T325754 (10fnegri) [18:28:36] 10SRE-tools, 10Cloud-Services, 10Infrastructure-Foundations, 10cloud-services-team, 10Patch-For-Review: Cumin/Openstack: multi-project commands are extremely slow - https://phabricator.wikimedia.org/T325773 (10fnegri) [18:30:22] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team: Decide sudoers rules for users without global root - https://phabricator.wikimedia.org/T325067 (10fnegri) [18:34:26] 10SRE, 10Cloud-Services, 10cloud-services-team, 10observability, and 3 others: Switch rsyslog from gtls to ossl - https://phabricator.wikimedia.org/T324623 (10fnegri) [18:35:56] 10Puppet, 10SRE, 10Data-Services, 10Infrastructure-Foundations, 10cloud-services-team: clouddumps1002: ferm is being started on every puppet run - https://phabricator.wikimedia.org/T323324 (10fnegri) [18:37:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:40:40] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (FY2022/2023-Q3): WMCS Cookbook Automation FY2022-23 Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri) [18:41:34] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10fnegri) [18:45:17] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team: Remove prod-specific bits from cloud puppetmasters - https://phabricator.wikimedia.org/T309281 (10fnegri) [18:46:34] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10fnegri) [18:48:45] 10SRE, 10cloud-services-team: Fix all .erb variable warnings - https://phabricator.wikimedia.org/T97251 (10fnegri) [18:51:17] 10SRE, 10DNS, 10Infrastructure-Foundations, 10cloud-services-team, 10netbox: Move some of wikimediacloud.org 185.15.56.0/23 to Netbox - https://phabricator.wikimedia.org/T268621 (10fnegri) [18:51:31] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Join ARIN waiting list to request additional IPv4 resources. - https://phabricator.wikimedia.org/T288342 (10fnegri) [18:52:14] 10SRE, 10Cloud-Services, 10Traffic-Icebox, 10cloud-services-team: Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10fnegri) [18:54:04] 10Puppet, 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10fnegri) [18:54:50] 10SRE, 10Privacy Engineering, 10Traffic-Icebox, 10Performance-Team (Radar), 10Privacy: Disable WMF-Last-Access cookies for wmfusercontent.org - https://phabricator.wikimedia.org/T210167 (10BCornwall) 05Open→03Resolved a:03BCornwall Hi, @Krinkle! Thanks to work in T262996 phab.wmfuserdata.org no lon... [18:58:54] (03CR) 10Dzahn: [C: 03+2] peopleweb: ensure rsync service is stopped on passive host [puppet] - 10https://gerrit.wikimedia.org/r/879878 (https://phabricator.wikimedia.org/T326888) (owner: 10Dzahn) [18:59:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1036.eqiad.wmnet with OS buster [19:00:04] jnuche and jeena: Dear deployers, time to do the Train log triage with CPT deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230118T1900). [19:00:05] jnuche and jeena: (Dis)respected human, time to deploy MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230118T1900). Please do the needful. [19:00:55] (03CR) 10Dzahn: "looks good to me! can you compile it one more time, this time on both hosts?" [puppet] - 10https://gerrit.wikimedia.org/r/880488 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [19:02:17] (03CR) 10Dzahn: [C: 03+1] vrts: add vrts2001 hieradata and database port [puppet] - 10https://gerrit.wikimedia.org/r/880488 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [19:02:19] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, 10User-jbond: Normalise hiera default values - https://phabricator.wikimedia.org/T289665 (10fnegri) [19:02:35] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Add more rspec test to the puppet code - https://phabricator.wikimedia.org/T289668 (10fnegri) [19:02:55] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, 10User-jbond: Audit puppet usage in cloud hosts - https://phabricator.wikimedia.org/T289658 (10fnegri) [19:03:25] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [19:06:26] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: Easing pain points caused by divergence between cloudservices and production puppet usecases - https://phabricator.wikimedia.org/T285539 (10fnegri) [19:10:14] 10SRE, 10Wikimedia-Mailing-lists, 10cloud-services-team: auto-subscribe cloud-vps and/or toolforge users to cloud-announce - https://phabricator.wikimedia.org/T278361 (10fnegri) [19:11:15] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team: wmcs.spicerack: Setup a host to run cookbooks from prod network - https://phabricator.wikimedia.org/T276440 (10fnegri) [19:12:54] 10SRE, 10cloud-services-team, 10serviceops: hosts failing puppet compile due to missing secrets - https://phabricator.wikimedia.org/T274392 (10fnegri) [19:14:13] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10cloud-services-team: Puppet class systemd needs to throw a more useful error - https://phabricator.wikimedia.org/T195553 (10fnegri) [19:15:08] 10Puppet, 10Infrastructure-Foundations, 10cloud-services-team: ops/puppet: generalize systemd resource control for users - https://phabricator.wikimedia.org/T215401 (10fnegri) [19:17:37] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team, 10IPv6: Some WMCS clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271139 (10fnegri) [19:19:29] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1037.eqiad.wmnet with OS buster [19:20:31] (03PS1) 10Eigyan: [config]: Undeploy GDI Safety Survey Wave 4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881462 (https://phabricator.wikimedia.org/T327296) [19:20:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10Papaul) [19:23:21] 10SRE, 10cloud-services-team, 10Security, 10User-MoritzMuehlenhoff: Implement SSH CA (certificate authority) for host keys? - https://phabricator.wikimedia.org/T268344 (10fnegri) [19:23:21] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1036.eqiad.wmnet with reason: host reimage [19:26:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1036.eqiad.wmnet with reason: host reimage [19:27:29] (03CR) 10EllenR: [C: 03+1] "looks like it should work" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881462 (https://phabricator.wikimedia.org/T327296) (owner: 10Eigyan) [19:28:54] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: ceph: test and decide 1 network interface setup - https://phabricator.wikimedia.org/T325531 (10fnegri) [19:29:52] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops, 10Patch-For-Review: Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10fnegri) [19:30:11] 10SRE, 10Observability-Metrics, 10Traffic-Icebox: varnishmtail silently stops working if varnishncsa crashes - https://phabricator.wikimedia.org/T259020 (10BCornwall) A simpler solution may be to use systemd's [[ https://www.freedesktop.org/software/systemd/man/systemd.unit.html#PartOf= | PartOf= ]] option:... [19:30:55] (03CR) 10AOkoth: [C: 03+2] vrts: add vrts2001 hieradata and database port [puppet] - 10https://gerrit.wikimedia.org/r/880488 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [19:31:14] 10SRE, 10Platform Engineering, 10cloud-services-team: Get platform engineering team green light for Cloud NAT to wikis change - https://phabricator.wikimedia.org/T273738 (10fnegri) [19:32:07] 10SRE, 10Cloud-VPS, 10Traffic-Icebox, 10cloud-services-team: Get traffic team green light for Cloud NAT to wikis change - https://phabricator.wikimedia.org/T273737 (10fnegri) [19:32:35] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1037.eqiad.wmnet with reason: host reimage [19:33:07] 10SRE, 10DNS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Cloud: define relationship between wikimediacloud.org domain, CIDR prefixes and netbox automation - https://phabricator.wikimedia.org/T266331 (10fnegri) [19:34:13] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: CloudVPS: IPv6 early PoC - https://phabricator.wikimedia.org/T245495 (10fnegri) [19:34:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host logstash1037.eqiad.wmnet with OS buster [19:35:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1037.eqiad.wmnet with reason: host reimage [19:36:04] 10SRE, 10DC-Ops, 10cloud-services-team: Supporting new hardware in older debian releases - https://phabricator.wikimedia.org/T301162 (10fnegri) [19:36:10] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team: spicerack: introduce GridEngine controller - https://phabricator.wikimedia.org/T300032 (10fnegri) [19:36:53] 10Puppet, 10Infrastructure-Foundations, 10cloud-services-team: Reduce the effects of puppet breakage on VPS - https://phabricator.wikimedia.org/T226270 (10fnegri) [19:37:00] 10Puppet, 10Infrastructure-Foundations, 10cloud-services-team, 10User-jbond: Prevent catalog breakage on cloud instances by decoupling core cloud puppetmaster from custom puppetmasters - https://phabricator.wikimedia.org/T227029 (10fnegri) [19:38:02] 10Puppet, 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, and 2 others: Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10fnegri) [19:38:12] (03CR) 10Herron: [C: 03+1] logstash: move blackbox-exporter logs to ecs-promblkboxexp indexes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881370 (https://phabricator.wikimedia.org/T327308) (owner: 10Cwhite) [19:40:24] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:42:14] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10fnegri) [19:44:44] 10SRE, 10Infrastructure-Foundations, 10netops, 10IPv6, and 2 others: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10BBlack) Bump - these issues continue to affect us sometimes. There seem to be some cases where Juniper can mis-route an RA to an... [19:45:55] (03PS1) 10AOkoth: site: add vrts role to vrts2001 [puppet] - 10https://gerrit.wikimedia.org/r/881465 (https://phabricator.wikimedia.org/T323515) [19:47:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:47:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1036.eqiad.wmnet with OS buster [19:47:56] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: CloudVPS: enable BGP in the neutron transport network - https://phabricator.wikimedia.org/T245606 (10fnegri) [19:50:01] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/881465/39171/vrts2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/881465 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [19:51:53] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:52:20] !log db1129 and lvs1017: removed misconfigured IP address in wrong vlan from eno1 and /e/n/i [19:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:13] 10Puppet, 10Cloud-Services, 10Infrastructure-Foundations, 10cloud-services-team: Consider ways to make puppetmaster CA changes smoother on the puppet client end - https://phabricator.wikimedia.org/T220268 (10fnegri) [19:54:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:54:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1037.eqiad.wmnet with OS buster [19:57:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host logstash1036.eqiad.wmnet with OS buster completed: - logst... [19:58:35] (03CR) 10Dzahn: [C: 03+1] "The only risk I saw here was that the new host could write to the DB, but we can see in compiler output that it gets " "vrts_databa" [puppet] - 10https://gerrit.wikimedia.org/r/881465 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [19:58:37] 10SRE, 10Infrastructure-Foundations, 10netops, 10IPv6, and 2 others: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10BBlack) I fixed all these cases noted above for now. Note that in the lvs1017 case, this could've potentially caused a public ser... [19:58:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host logstash1037.eqiad.wmnet with OS buster completed: - logst... [20:00:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10Papaul) [20:02:36] (03CR) 10Dzahn: "eh, did you expect this to the change the error page content? https://puppet-compiler.wmflabs.org/output/881359/39170/gerrit1001.wikimedia" [puppet] - 10https://gerrit.wikimedia.org/r/881359 (https://phabricator.wikimedia.org/T323262) (owner: 10Hashar) [20:03:06] (03CR) 10Dzahn: "I see. it comes from https://gerrit.wikimedia.org/r/c/operations/puppet/+/881350/2 .. let me go there first then" [puppet] - 10https://gerrit.wikimedia.org/r/881359 (https://phabricator.wikimedia.org/T323262) (owner: 10Hashar) [20:04:07] (03CR) 10AOkoth: [C: 03+2] site: add vrts role to vrts2001 [puppet] - 10https://gerrit.wikimedia.org/r/881465 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [20:09:46] (03CR) 10Dzahn: [C: 03+2] "I like how you made the IRC links actual links. https://puppet-compiler.wmflabs.org/output/881350/39172/" [puppet] - 10https://gerrit.wikimedia.org/r/881350 (owner: 10Hashar) [20:11:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10Papaul) 05Open→03Resolved @herron @colewhite this is complete [20:12:54] (03CR) 10Dzahn: [C: 03+2] "yea, confirmed it does not appear to be used anywhere" [puppet] - 10https://gerrit.wikimedia.org/r/881351 (owner: 10Hashar) [20:16:15] (03CR) 10Dzahn: [C: 03+2] "lgtm. https://puppet-compiler.wmflabs.org/output/881359/39173/" [puppet] - 10https://gerrit.wikimedia.org/r/881359 (https://phabricator.wikimedia.org/T323262) (owner: 10Hashar) [20:18:33] (03CR) 10Dzahn: [C: 03+2] "deployed first on gerrit2002, confirmed noop, deployed on gerrit1001, confirmed noop" [puppet] - 10https://gerrit.wikimedia.org/r/881359 (https://phabricator.wikimedia.org/T323262) (owner: 10Hashar) [20:19:27] (03CR) 10Dzahn: [C: 03+2] admin: bash config for Antoine [puppet] - 10https://gerrit.wikimedia.org/r/877093 (owner: 10Hashar) [20:23:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:24:58] 10SRE, 10serviceops-collab, 10Patch-For-Review: rsync server on people2002 - https://phabricator.wikimedia.org/T326888 (10Dzahn) 05Open→03Resolved [20:27:26] (03PS1) 10Zabe: Use core's PoolCounterClient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881466 (https://phabricator.wikimedia.org/T327336) [20:27:28] (03PS1) 10Zabe: Stop loading PoolCounter extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881467 (https://phabricator.wikimedia.org/T327336) [20:27:30] (03PS1) 10Zabe: Remove PoolCounter from extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881468 (https://phabricator.wikimedia.org/T327336) [20:29:07] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for KSarabia-WMF - https://phabricator.wikimedia.org/T327337 (10KSarabia-WMF) [20:29:20] PROBLEM - Check systemd state on vrts2001 is CRITICAL: CRITICAL - degraded: The following units failed: vrts-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:30:30] arnoldokoth: ^ adding the role also meant getting monitoring. but it's not ready yet. so let's add some downtimes with the downtime cookbook. it can be 14 days or whatever [20:31:08] Oof. On it. [20:31:33] np, you can run it from cumin hosts and will affect all services on the host [20:34:18] !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: installation failed due to read-only database [20:34:31] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: installation failed due to read-only database [20:35:12] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1036.eqiad.wmnet with OS bullseye [20:36:03] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1037.eqiad.wmnet with OS bullseye [20:42:14] RECOVERY - Check systemd state on vrts2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:44:53] (03CR) 10Cwhite: logstash: move blackbox-exporter logs to ecs-promblkboxexp indexes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881370 (https://phabricator.wikimedia.org/T327308) (owner: 10Cwhite) [20:45:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:48:12] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1036.eqiad.wmnet with reason: host reimage [20:49:05] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1037.eqiad.wmnet with reason: host reimage [20:49:45] (03PS4) 10Dzahn: deployment_server: Migrate tools/release to gitlab [puppet] - 10https://gerrit.wikimedia.org/r/879908 (https://phabricator.wikimedia.org/T290260) (owner: 10Jeena Huneidi) [20:50:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:51:02] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1036.eqiad.wmnet with reason: host reimage [20:53:37] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/881370 (https://phabricator.wikimedia.org/T327308) (owner: 10Cwhite) [20:53:44] (03CR) 10Ottomata: "Were there any changes to templates/crds? if so they should be copied to flink-kubernetes-operator-crds/templates. If not, we good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking) [20:54:07] RECOVERY - Host ps1-b2-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.57 ms [20:54:07] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1037.eqiad.wmnet with reason: host reimage [20:54:35] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/879908/39174/" [puppet] - 10https://gerrit.wikimedia.org/r/879908 (https://phabricator.wikimedia.org/T290260) (owner: 10Jeena Huneidi) [20:57:29] (03CR) 10Dzahn: [C: 03+2] "as expected this is a noop on deployment* and releases* since the git::clone is only "ensure present". So you should check the next time y" [puppet] - 10https://gerrit.wikimedia.org/r/879908 (https://phabricator.wikimedia.org/T290260) (owner: 10Jeena Huneidi) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230118T2100). [21:00:05] Jdlrobson, Lucas_WMDE, and eigyan: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:33] I can deploy. [21:01:04] Greetings All 0/ [21:01:11] present [21:02:23] (03PS2) 10Cwhite: logstash: clean up curator actions todo items [puppet] - 10https://gerrit.wikimedia.org/r/869251 (https://phabricator.wikimedia.org/T301760) [21:02:54] Jdlrobson: what user-facing change will https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/881431/ have? [21:02:56] (03PS2) 10Cwhite: logstash: change ecs-default clean up policy to prefix [puppet] - 10https://gerrit.wikimedia.org/r/869252 [21:03:20] (03PS2) 10Cwhite: logstash: change ecs-test clean up policy to prefix [puppet] - 10https://gerrit.wikimedia.org/r/869253 [21:03:40] !start UTC late backport window [21:03:43] (03PS2) 10Cwhite: logstash: change w3creportingapi clean up policy to prefix [puppet] - 10https://gerrit.wikimedia.org/r/869254 [21:03:45] !log start UTC late backport window [21:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:04] kindrobot: it's a regression on the current train (details in phab ticket). Not sure I understand the question? [21:05:13] We don't want it to go out to group 2 wikis [21:05:28] I see. OK. [21:05:29] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1036.eqiad.wmnet with OS bullseye [21:06:00] Is it safe to deploy your two together Jdlrobson? [21:06:16] (03CR) 10Cwhite: [C: 03+2] profile: clean up legacy apifeatureusage class [puppet] - 10https://gerrit.wikimedia.org/r/879886 (owner: 10Cwhite) [21:07:29] kindrobot: yep [21:07:49] Great. I'll start them now. [21:08:31] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1037.eqiad.wmnet with OS bullseye [21:08:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881451 (https://phabricator.wikimedia.org/T326892) (owner: 10Jdlrobson) [21:08:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/881431 (https://phabricator.wikimedia.org/T327256) (owner: 10Jdlrobson) [21:09:08] (03PS2) 10Stef Dunlap: Bump English Wikipedia event logging from 0.5 to 1% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881451 (https://phabricator.wikimedia.org/T326892) (owner: 10Jdlrobson) [21:09:44] (03CR) 10TrainBranchBot: "Approved by kindrobot@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881451 (https://phabricator.wikimedia.org/T326892) (owner: 10Jdlrobson) [21:09:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/881431 (https://phabricator.wikimedia.org/T327256) (owner: 10Jdlrobson) [21:10:30] (03Merged) 10jenkins-bot: Bump English Wikipedia event logging from 0.5 to 1% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881451 (https://phabricator.wikimedia.org/T326892) (owner: 10Jdlrobson) [21:10:53] (03PS1) 10Cwhite: site: remove insetup::observability role from logstash203[67] [puppet] - 10https://gerrit.wikimedia.org/r/881371 (https://phabricator.wikimedia.org/T321335) [21:14:13] (03PS1) 10Bking: cloudelastic: bump smaller cluster heap from 10 to 12G [puppet] - 10https://gerrit.wikimedia.org/r/881474 (https://phabricator.wikimedia.org/T323646) [21:16:24] (03PS1) 10Cwhite: site: assign logging::opensearch::data role to logstash103[67] [puppet] - 10https://gerrit.wikimedia.org/r/881372 (https://phabricator.wikimedia.org/T327338) [21:16:27] (03PS1) 10Cwhite: logstash: add logstash103[67] to cluster config [puppet] - 10https://gerrit.wikimedia.org/r/881373 (https://phabricator.wikimedia.org/T327338) [21:18:42] (03CR) 10Cwhite: [C: 03+2] site: remove insetup::observability role from logstash203[67] [puppet] - 10https://gerrit.wikimedia.org/r/881371 (https://phabricator.wikimedia.org/T321335) (owner: 10Cwhite) [21:20:03] (03PS2) 10Cwhite: logstash: add logstash103[67] to cluster config [puppet] - 10https://gerrit.wikimedia.org/r/881373 (https://phabricator.wikimedia.org/T327338) [21:21:02] (03CR) 10Ebernhardson: cloudelastic: bump smaller cluster heap from 10 to 12G (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881474 (https://phabricator.wikimedia.org/T323646) (owner: 10Bking) [21:22:17] (03PS2) 10Cwhite: site: assign logging::opensearch::data role to logstash103[67] [puppet] - 10https://gerrit.wikimedia.org/r/881372 (https://phabricator.wikimedia.org/T327338) [21:22:23] hello [21:22:55] Hey eigyan, we're still in the middle of Jdlrobson's backport. [21:22:57] (03Merged) 10jenkins-bot: Legacy Vector is not a responsive skin [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/881431 (https://phabricator.wikimedia.org/T327256) (owner: 10Jdlrobson) [21:23:18] ok my connection dropped :0 [21:23:25] !log kindrobot@deploy1002 Started scap: Backport for [[gerrit:881451|Bump English Wikipedia event logging from 0.5 to 1% (T326892)]], [[gerrit:881431|Legacy Vector is not a responsive skin (T327256)]] [21:23:30] T327256: Desktop mode problem on mobile with the legacy Vector - https://phabricator.wikimedia.org/T327256 [21:23:31] T326892: Make Vector 2022 the default skin on English Wikipedia - https://phabricator.wikimedia.org/T326892 [21:23:52] (03PS2) 10Bking: cloudelastic: bump smaller cluster heap from 10 to 12G [puppet] - 10https://gerrit.wikimedia.org/r/881474 (https://phabricator.wikimedia.org/T323646) [21:24:51] No worries. [21:24:51] (03CR) 10Bking: "oops, forgot to hit save. New patchset incoming..." [puppet] - 10https://gerrit.wikimedia.org/r/881474 (https://phabricator.wikimedia.org/T323646) (owner: 10Bking) [21:25:33] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/881474 (https://phabricator.wikimedia.org/T323646) (owner: 10Bking) [21:25:49] !log kindrobot@deploy1002 kindrobot and jdlrobson: Backport for [[gerrit:881451|Bump English Wikipedia event logging from 0.5 to 1% (T326892)]], [[gerrit:881431|Legacy Vector is not a responsive skin (T327256)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [21:26:07] Jdlrobson: can you confirm? [21:26:31] (03CR) 10Ebernhardson: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/881474 (https://phabricator.wikimedia.org/T323646) (owner: 10Bking) [21:27:03] kindrobot: oh it's ready? Looking! [21:28:27] kindrobot: LGTM [21:28:41] Great, both patches? [21:29:08] kindrobot: yep [21:29:13] (03CR) 10Bking: [C: 03+2] cloudelastic: bump smaller cluster heap from 10 to 12G [puppet] - 10https://gerrit.wikimedia.org/r/881474 (https://phabricator.wikimedia.org/T323646) (owner: 10Bking) [21:29:18] Thank you, syncing. :) [21:30:39] eigyan: we'll do yours next after the sync as I don't see Lucas_WMDE in the channel. [21:31:24] perfect thank you kindrobot [21:36:01] thank you kindrobot ! [21:36:27] !log kindrobot@deploy1002 Finished scap: Backport for [[gerrit:881451|Bump English Wikipedia event logging from 0.5 to 1% (T326892)]], [[gerrit:881431|Legacy Vector is not a responsive skin (T327256)]] (duration: 13m 01s) [21:36:32] T327256: Desktop mode problem on mobile with the legacy Vector - https://phabricator.wikimedia.org/T327256 [21:36:33] T326892: Make Vector 2022 the default skin on English Wikipedia - https://phabricator.wikimedia.org/T326892 [21:36:43] No problem, thanks Jdlrobson :) [21:37:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881462 (https://phabricator.wikimedia.org/T327296) (owner: 10Eigyan) [21:37:41] (03PS2) 10Stef Dunlap: [config]: Undeploy GDI Safety Survey Wave 4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881462 (https://phabricator.wikimedia.org/T327296) (owner: 10Eigyan) [21:38:18] (03CR) 10TrainBranchBot: "Approved by kindrobot@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881462 (https://phabricator.wikimedia.org/T327296) (owner: 10Eigyan) [21:38:57] (03Merged) 10jenkins-bot: [config]: Undeploy GDI Safety Survey Wave 4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881462 (https://phabricator.wikimedia.org/T327296) (owner: 10Eigyan) [21:39:20] !log kindrobot@deploy1002 Started scap: Backport for [[gerrit:881462|[config]: Undeploy GDI Safety Survey Wave 4 (T327296)]] [21:39:24] T327296: Undeploy GDI Safety Survey Wave 4 on EN, ES, FR, FA, PT wikis - https://phabricator.wikimedia.org/T327296 [21:41:10] !log kindrobot@deploy1002 essexigyan and kindrobot: Backport for [[gerrit:881462|[config]: Undeploy GDI Safety Survey Wave 4 (T327296)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [21:41:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10Jclark-ctr) [21:41:19] eigyan: can you confirm? [21:41:22] will do kindrobot [21:42:48] kindrobot all is 💯 [21:43:24] :D syncing [21:43:37] thank you kindrobot [21:44:51] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, 10User-dcaro: Investigate use of Puppet "environments" for per-project Puppet manifests - https://phabricator.wikimedia.org/T170370 (10fnegri) [21:45:09] Last call for Lucas Werkmeister (Lucas_WMDE) or anyone who will speak to "[config] 878927 (deploy commands) Enable the REST API on test-wikidata" [21:45:35] that patch is already merged and deployed [21:45:52] Heh, I feel silly now. Thanks zabe. [21:46:32] yw :) [21:50:05] !log kindrobot@deploy1002 Finished scap: Backport for [[gerrit:881462|[config]: Undeploy GDI Safety Survey Wave 4 (T327296)]] (duration: 10m 45s) [21:50:09] T327296: Undeploy GDI Safety Survey Wave 4 on EN, ES, FR, FA, PT wikis - https://phabricator.wikimedia.org/T327296 [21:50:23] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team: Cumin: create external backend for WMCS Puppet API - https://phabricator.wikimedia.org/T179816 (10fnegri) [21:50:27] !log close UTC late backport window [21:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:33] thanks everyone :) [21:54:21] 10SRE, 10Data-Services, 10Infrastructure-Foundations, 10cloud-services-team, 10User-MoritzMuehlenhoff: Switch labstore servers to default SSH configuration - https://phabricator.wikimedia.org/T177914 (10fnegri) [22:03:32] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: raise heap memory to 12G - bking@cumin1001 - T323646 [22:03:36] T323646: Observe results from JVM options/heap memory changes - https://phabricator.wikimedia.org/T323646 [22:16:15] 10SRE, 10Sustainability (Incident Followup): get a legend for haproxy "anomalous session termination states" - https://phabricator.wikimedia.org/T308952 (10RLazarus) 05Open→03Resolved a:03RLazarus (Back from vacation, sorry for the delay.) Yeah, I think we can close this. Thanks! [22:25:55] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:27:31] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:35:23] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: raise heap memory to 12G - bking@cumin1001 - T323646 [22:35:26] T323646: Observe results from JVM options/heap memory changes - https://phabricator.wikimedia.org/T323646 [22:37:47] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:44:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:48:06] I just got a 503 [22:48:21] *server error 503 [22:51:30] 10SRE-swift-storage: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10Platonides) From the small portion, it would seem that the files were uploaded and when they were lateret deleted at MediaWiki, the files would be copied into the archive, and rmed from swi... [22:54:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:58:01] (CirrusSearchHighOldGCFrequency) resolved: (2) Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [23:03:25] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [23:42:14] (03PS2) 10Dreamy Jazz: Pin CheckUserEventTablesMigrationStage to read and write old [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881390 (https://phabricator.wikimedia.org/T324907) [23:47:49] !log run populateCulComment.php on all group0 and group1 wikis # T327290 [23:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:53] T327290: Run PopulateCulComment on all wikis - https://phabricator.wikimedia.org/T327290