[00:00:39] PROBLEM - Host ping3001 is DOWN: PING CRITICAL - Packet loss = 100% [00:01:10] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Easing pain points caused by divergence between cloudservices and production puppet usecases - https://phabricator.wikimedia.org/T285539 (10Bstorm) >>! In T285539#7178969, @faidon wrote: > **In other words, 2021'... [00:02:11] RECOVERY - Host ping3001 is UP: PING OK - Packet loss = 0%, RTA = 107.26 ms [00:02:47] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [00:02:47] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [00:03:18] Be there in a minute [00:05:38] o/ [00:07:14] hi [00:07:47] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [00:07:47] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [00:08:01] welp [00:08:07] I guess we're good? [00:08:26] also I thought it was only supposed to page us if the librenms alert lasted 5 minutes [00:10:46] cwhite: ^ ? [00:10:47] alert started 23:56:16 - page ~00:02 [00:10:52] ah, ok [00:11:00] ended 00:06 [00:11:20] (03CR) 10Krinkle: "It'll probably need a tweak to docroot/noc/db.php at least. Possibly a few other places." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701995 (owner: 10Tim Starling) [00:11:26] graphs don't look too out of sorts though [00:12:30] the one thing I was a bit worried about is that this was for eqdfw, and we just shifted a bunch of traffic to codfw [00:13:48] (03CR) 10Krinkle: [C: 03+1] "logging loads before db-*, but this isn't currently obvious or required I think. A comment in the header of db-* files and/or in COmmonSet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701995 (owner: 10Tim Starling) [00:13:59] PROBLEM - Host lvs3005 is DOWN: PING CRITICAL - Packet loss = 100% [00:14:19] RECOVERY - Host lvs3005 is UP: PING OK - Packet loss = 0%, RTA = 106.89 ms [00:23:11] PROBLEM - Host ping3001 is DOWN: PING CRITICAL - Packet loss = 100% [00:23:45] RECOVERY - Host ping3001 is UP: PING OK - Packet loss = 0%, RTA = 109.32 ms [00:25:56] !log krinkle@mwmaint1002: purgeParserCache.php --tag pc1, ref T282761 [00:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:06] T282761: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761 [00:49:44] (03PS3) 10Tim Starling: Include SQL queries in the debug log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701995 [00:50:21] PROBLEM - Disk space on elastic1039 is CRITICAL: DISK CRITICAL - free space: / 2863 MB (10% inode=94%): /tmp 2863 MB (10% inode=94%): /var/tmp 2862 MB (10% inode=94%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1039&var-datasource=eqiad+prometheus/ops [00:51:34] (03CR) 10Tim Starling: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701995 (owner: 10Tim Starling) [01:11:15] RECOVERY - Disk space on elastic1039 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1039&var-datasource=eqiad+prometheus/ops [01:45:05] (03CR) 10Krinkle: [C: 03+1] Include SQL queries in the debug log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701995 (owner: 10Tim Starling) [01:50:17] !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.75`. Pre-deploy tests passing on canary `wdqs1003` [01:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:56:07] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@0e916b1]: 0.3.75 [01:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:57:43] !log [WDQS Deploy] Tests passing following deploy of `0.3.75` on canary `wdqs1003`; proceeding to rest of fleet [01:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:00:05] Deploy window Branching MediaWiki, extensions, skins, and vendor – See Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210629T0200) [02:04:48] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@0e916b1]: 0.3.75 (duration: 08m 40s) [02:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:57] PROBLEM - Check systemd state on mwlog2002 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:46] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [02:07:47] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [02:08:05] here [02:08:25] Hi [02:08:31] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.37.0-wmf.12 [core] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702002 [02:08:33] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.37.0-wmf.12 [core] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702002 (owner: 10TrainBranchBot) [02:08:37] Looks like the same alert as earlier [02:08:50] yeah [02:08:57] it's the peering link in eqdfw [02:09:19] very likely to be aggressive scraping from some commercial cloud or from big tech [02:09:35] o/ [02:10:14] https://w.wiki/3ZHZ [02:10:18] AS8075 Microsoft Corp [02:11:07] against upload-lb.codfw.wikimedia.org [02:11:27] not the first time we've had very aggressive scraping against upload-lb recently heh [02:13:52] looks like a bunch of random images [02:15:06] empty user-agent [02:15:45] I put some IPs in _security [02:17:46] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [02:17:47] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [02:22:50] (03PS1) 10CDanis: upload: empty User-Agent also violates policy [puppet] - 10https://gerrit.wikimedia.org/r/702003 [02:24:15] (03CR) 10Legoktm: [C: 03+1] upload: empty User-Agent also violates policy [puppet] - 10https://gerrit.wikimedia.org/r/702003 (owner: 10CDanis) [02:25:07] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:25:55] (03CR) 10Cwhite: [C: 03+1] upload: empty User-Agent also violates policy [puppet] - 10https://gerrit.wikimedia.org/r/702003 (owner: 10CDanis) [02:26:05] (03CR) 10CDanis: [C: 03+2] upload: empty User-Agent also violates policy [puppet] - 10https://gerrit.wikimedia.org/r/702003 (owner: 10CDanis) [02:27:50] (03Merged) 10jenkins-bot: Branch commit for wmf/1.37.0-wmf.12 [core] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702002 (owner: 10TrainBranchBot) [02:27:53] !log ✔️ cdanis@cumin2001.codfw.wmnet ~ 🕥🍺 sudo cumin -b16 'A:cp-upload' 'run-puppet-agent -q' [02:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:25] 10SRE, 10Traffic, 10User-notice: Rate limit requests in violation of User-Agent policy more aggressively - https://phabricator.wikimedia.org/T224891 (10Legoktm) We responded to another set of pages today and most of the offending requests were coming from a public Cloud with no User-agent, so we've banned th... [02:34:31] !log T285643 Banned `elastic1039` from all 3 elasticsearch clusters and set `elastic1039.eqiad.wmnet` to failed in netbox [02:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:38] T285643: Degraded RAID on elastic1039 - https://phabricator.wikimedia.org/T285643 [02:39:14] I don't suppose anyone here can help clean up a small mess that happened regarding the branch cut? was just asking on #releng also... [02:42:46] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [02:42:47] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [02:43:48] (03PS1) 10CDanis: upload: empty UA and - UA also violate policy [puppet] - 10https://gerrit.wikimedia.org/r/702026 [02:45:20] (03CR) 10Cwhite: [C: 03+1] upload: empty UA and - UA also violate policy [puppet] - 10https://gerrit.wikimedia.org/r/702026 (owner: 10CDanis) [02:45:32] (03CR) 10Legoktm: [C: 03+1] "This seems correct based on https://varnish-cache.org/docs/6.0/reference/vcl.html#booleans" [puppet] - 10https://gerrit.wikimedia.org/r/702026 (owner: 10CDanis) [02:45:51] (03CR) 10CDanis: [C: 03+2] upload: empty UA and - UA also violate policy [puppet] - 10https://gerrit.wikimedia.org/r/702026 (owner: 10CDanis) [02:47:17] !log ✔️ cdanis@cumin2001.codfw.wmnet ~ 🕥🍺 sudo cumin -b16 'A:cp-upload and A:codfw' 'run-puppet-agent -q' [02:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:47:46] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [02:47:47] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [03:12:46] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [03:12:46] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [03:21:27] (03PS1) 10Legoktm: upload: Fully block requests with missing or empty User-Agent [puppet] - 10https://gerrit.wikimedia.org/r/702027 [03:23:05] (03CR) 10CDanis: [C: 03+1] upload: Fully block requests with missing or empty User-Agent [puppet] - 10https://gerrit.wikimedia.org/r/702027 (owner: 10Legoktm) [03:23:14] (03CR) 10Cwhite: [C: 03+1] upload: Fully block requests with missing or empty User-Agent [puppet] - 10https://gerrit.wikimedia.org/r/702027 (owner: 10Legoktm) [03:23:25] (03PS2) 10Legoktm: upload: Fully block requests with missing or empty User-Agent [puppet] - 10https://gerrit.wikimedia.org/r/702027 [03:25:52] (03CR) 10Legoktm: [C: 03+2] upload: Fully block requests with missing or empty User-Agent [puppet] - 10https://gerrit.wikimedia.org/r/702027 (owner: 10Legoktm) [03:32:46] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [03:32:46] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [03:52:21] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:53:01] 10SRE, 10Traffic, 10User-notice: Rate limit requests in violation of User-Agent policy more aggressively - https://phabricator.wikimedia.org/T224891 (10colewhite) Changeset banning empty user agents: https://gerrit.wikimedia.org/r/702027 Result: https://grafana.wikimedia.org/d/000000503/varnish-http-errors?... [04:15:02] (03PS1) 10Tim Starling: Add statsd timing for actions [core] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/702010 (https://phabricator.wikimedia.org/T284274) [04:15:12] (03CR) 10Tim Starling: [C: 03+2] Add statsd timing for actions [core] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/702010 (https://phabricator.wikimedia.org/T284274) (owner: 10Tim Starling) [04:23:37] RECOVERY - SSH on mw1303.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:33:01] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 221, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:33:06] (03Merged) 10jenkins-bot: Add statsd timing for actions [core] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/702010 (https://phabricator.wikimedia.org/T284274) (owner: 10Tim Starling) [04:33:15] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:42:43] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:42:59] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:15:21] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [05:19:51] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:21:33] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 220, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:21:53] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.4154 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [05:26:09] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:27:29] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [05:27:59] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:28:05] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:29:31] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:33:05] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:39:05] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [05:39:13] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.07692 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [05:49:56] (03CR) 10Jcrespo: [C: 03+2] Revert "dbbackups: Temporarily disable s4 snapshots to prevent conflict with dumps" [puppet] - 10https://gerrit.wikimedia.org/r/701721 (owner: 10Jcrespo) [05:51:02] (03CR) 10Jcrespo: [C: 03+2] Revert "dbbackups: Temporarily change backup schedules to fit better dc switch" [puppet] - 10https://gerrit.wikimedia.org/r/701717 (owner: 10Jcrespo) [06:05:29] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [06:16:55] 10SRE, 10serviceops, 10Wikimedia-Incident: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10elukey) Just as reminder, mw1384 was [[ https://sal.toolforge.org/log/_2yPSHoBa_6PSCT9smu4 | dep... [06:28:49] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:31:57] <_joe_> Sigh happening again (high mw latency) [06:32:16] <_joe_> elukey: if the situation is not tragic, please wait for me to be around in a few [06:33:13] _joe_ it seems a little different this time, there was a big peak of latency that then auto-resolved [06:33:39] <_joe_> that's usually some backend slowness [06:38:19] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [06:42:41] I am seeing some correlation between that and db1136 having spikes [06:42:43] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [06:42:46] I am investigating [06:42:51] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.4308 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [06:43:25] db1136 is s7 master, and looks like it got some writes spikes [06:44:50] Ah no, never mind [06:46:18] !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.11/includes/MediaWiki.php: Add statsd action timing metric T284274 (duration: 00m 58s) [06:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:28] T284274: action=history allows for limits as high as 5000, which is probably too high - https://phabricator.wikimedia.org/T284274 [06:48:31] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [06:54:02] I see some mw appservers with 0 idle workers, like 4 [06:54:09] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.04615 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [06:55:54] given https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 it seems something different from the past days, as Joe mentioend it may be a slow backend [07:05:01] !log upgrading bullseye early installs to the latest state of testing T275873 [07:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:09] T275873: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 [07:10:57] PROBLEM - SSH on mw1296.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:13:31] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [07:19:55] uff [07:25:39] from the ATS point of view I don't see an increase in traffic towards appservers, Manuel checked and no db-slowness aligns, we end up with busy appservers for some reason [07:26:36] (also it seems that the busy appservers are not failing health checks, pybal logs clear afaics) [07:29:41] latency seems stabilizing again [07:32:38] (03CR) 10Filippo Giunchedi: [C: 03+1] varnish: add counters for Varnish SLI [puppet] - 10https://gerrit.wikimedia.org/r/701358 (https://phabricator.wikimedia.org/T284576) (owner: 10Ema) [07:35:39] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [07:38:00] anybody can help with --^ [07:38:47] the last latency increase is very minor compared to the other ones [07:38:51] should solve soon-ish [07:38:55] (if it doesn't get worse_ [07:48:56] !log remove old /root/prometheus data from prometheus4001 [07:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:00] !log remove 20G migration data /root/prometheus from prometheus4001 - T243057 [07:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:06] T243057: Move Prometheus off eqsin/ulsfo/esams bastions - https://phabricator.wikimedia.org/T243057 [07:53:02] (03PS9) 10Giuseppe Lavagetto: mediawiki: Remove references to obsolete rpc/RunJobs.php endpoint [puppet] - 10https://gerrit.wikimedia.org/r/575392 (https://phabricator.wikimedia.org/T243096) (owner: 10Aaron Schulz) [07:57:05] PROBLEM - Disk space on elastic1039 is CRITICAL: DISK CRITICAL - free space: / 2763 MB (10% inode=94%): /tmp 2763 MB (10% inode=94%): /var/tmp 2763 MB (10% inode=94%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1039&var-datasource=eqiad+prometheus/ops [07:58:06] gehel, ryankemper --^ [08:02:59] !log Upgraded Jenkins on releases1002 / releases2002 [08:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:04] elukey: thanks! looking [08:03:07] !log Upgraded Jenkins on releases1002 / releases2002 # T285531 [08:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:13] T285531: Upgrade Jenkins to 2.289.x - https://phabricator.wikimedia.org/T285531 [08:05:39] looks like a disk disapeared on that server. No urgent concern, we have enough redundancy [08:07:03] gehel: ah okok, now I recall that I tagged you on an elastic raid issue reported to dcops, it must be the same node then [08:07:15] yep: T285643 [08:07:16] T285643: Degraded RAID on elastic1039 - https://phabricator.wikimedia.org/T285643 [08:07:42] 10SRE, 10ops-eqiad, 10Discovery, 10Discovery-Search (Current work): Degraded RAID on elastic1039 - https://phabricator.wikimedia.org/T285643 (10Gehel) [08:07:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: Remove references to obsolete rpc/RunJobs.php endpoint [puppet] - 10https://gerrit.wikimedia.org/r/575392 (https://phabricator.wikimedia.org/T243096) (owner: 10Aaron Schulz) [08:12:11] !log Upgrading Jenkins on contint2001 / contint1001 and restarting CI Jenkins # T285531 [08:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:18] T285531: Upgrade Jenkins to 2.289.x - https://phabricator.wikimedia.org/T285531 [08:12:21] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [08:12:41] (03CR) 10Ema: [C: 03+1] "LGTM! Perhaps to err on the side of caution we could merge this with puppet disabled and test on one text and one upload node first though" [puppet] - 10https://gerrit.wikimedia.org/r/701073 (https://phabricator.wikimedia.org/T285374) (owner: 10Vgutierrez) [08:14:30] ah lovely, on one of the busy appservers [08:14:31] PHP Fatal error: Allowed memory size of 524288000 bytes exhausted (tried to allocate 67108872 bytes) [08:15:06] and /apcu-frag returns the error page [08:17:59] RECOVERY - Disk space on elastic1039 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1039&var-datasource=eqiad+prometheus/ops [08:18:15] _joe_ (if you have a moment) - from a quick look it seems that some appservers end up into a weird state like the weekend, it may need a roll restart [08:19:15] <_joe_> elukey: gimme 1 sec, I'm rolling out a change right now [08:19:20] the slow log is interesting, it mentions a lot apcu-related stack traces for object cache [08:19:31] sure [08:19:41] <_joe_> elukey: can you leave one broken and restart the others? [08:20:20] _joe_ there is already mw1384 from sunday, I can leave another one as well, ok to depooled=no two appservers? (just to confirm) [08:20:27] <_joe_> yes [08:20:35] <_joe_> I wanted a fresh one, name one [08:21:09] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw1355.eqiad.wmnet [08:21:14] there were 2 patches that were characterised as risky [08:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:36] <_joe_> so, it seems clear that what is happening is we moved even more stuff to use apcu [08:21:38] !log depool mw1355 (mw appserver) for debugging - T285634 [08:21:44] <_joe_> which isn't great tbh [08:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:45] T285634: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 [08:21:52] we can either bump it [08:21:57] or ask to rollback the patches [08:22:40] <_joe_> effie: which patches? [08:22:41] cumin 'A:mw-eqiad' '/usr/local/sbin/restart-php7.2-fpm' -b 2 -s 30 [08:22:44] does it look ok? [08:22:49] _joe_: https://phabricator.wikimedia.org/T281152 [08:23:52] sorry, 3 [08:24:02] I am reading to see if any of them could cause this [08:24:16] <_joe_> elukey: +1 [08:24:27] <_joe_> did anyone try to dump apcu on those servers? [08:25:21] _joe_ didn't try since I didn't know how, is it via php7adm? the /apcu-frag on the servers with memory errors return the error page [08:25:24] I can do it [08:25:24] 10SRE, 10serviceops, 10Wikimedia-Incident: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Joe) The problem seems to be quite clearly caused by excessive apcu locking. Let's review the sy... [08:25:43] !log cumin 'A:mw-eqiad' '/usr/local/sbin/restart-php7.2-fpm' -b 2 -s 30 - T285634 [08:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:57] <_joe_> elukey: apparently even /apcu-meta can't work [08:27:30] it is only affects appservers too [08:28:42] 10SRE, 10serviceops, 10Wikimedia-Incident: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Joe) Also: the memory gets exhausted by this operation: ` $sma_info = apcu_sma_info(); ` this m... [08:34:35] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [08:35:02] more or less half way through restarts [08:35:37] 10SRE, 10serviceops, 10Release, 10Train Deployments, 10Wikimedia-Incident: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Joe) p:05High→03Unbreak! The recurring problem seems to... [08:36:38] 10SRE, 10serviceops, 10Release, 10Train Deployments, 10Wikimedia-Incident: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Joe) [08:38:10] <_joe_> elukey: we can repool those two servers as far as I'm concerned [08:38:39] <_joe_> also AIUI you restarted all of them anyways [08:39:42] _joe_ yes my mistake, didn't think about it [08:39:49] I'll repool them [08:39:52] <_joe_> that's ok, don't worry [08:39:59] <_joe_> we'll have plenty cases by tonight [08:43:56] !log gehel@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=elastic1039.eqiad.wmnet [08:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:37] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: BGP Policy on aggregate routes prevents them being created in some circumstances. - https://phabricator.wikimedia.org/T283163 (10cmooney) 05Open→03Resolved [08:46:37] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1355.eqiad.wmnet [08:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:44] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1384.eqiad.wmnet [08:46:46] 10SRE, 10serviceops, 10Release, 10Train Deployments, 10Wikimedia-Incident: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10tstarling) >>! In T285634#7181000, @Legoktm wrote: >> and al... [08:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:57] all right all good, restarts completed and debug hosts repooled [08:47:21] !log repool mw13[55,84] after debugging - T285634 [08:47:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:28] T285634: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 [09:07:12] (03Abandoned) 10Phuedx: WIP: vector: Disable highlighting query in search autocomplete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699416 (https://phabricator.wikimedia.org/T281797) (owner: 10Phuedx) [09:12:33] RECOVERY - SSH on mw1296.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:13:03] 10SRE, 10serviceops, 10Release, 10Train Deployments, 10Wikimedia-Incident: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10jijiki) I don't know if it helps, the increasing numbers of... [09:22:03] 10SRE, 10serviceops, 10Release, 10Train Deployments, 10Wikimedia-Incident: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Joe) @jijiki I think this correlation is another hint that w... [09:24:32] 10SRE, 10serviceops, 10Release, 10Train Deployments, 10Wikimedia-Incident: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Addshore) I have a suspicion that this Wikibase cache is rel... [09:25:55] 10SRE, 10Wikidata, 10serviceops, 10wdwb-tech, and 3 others: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Addshore) [09:27:06] !log installing nettle security updates on buster [09:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:22] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/701931 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [09:28:22] (03CR) 10Elukey: [C: 03+1] logstash: transition aqs logs to ECS [puppet] - 10https://gerrit.wikimedia.org/r/701617 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [09:29:37] 10SRE, 10Wikidata, 10serviceops, 10wdwb-tech, and 5 others: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Ladsgroup) a:03Ladsgroup On to find and revert/fix the culprit [09:30:02] 10SRE, 10Wikidata, 10serviceops, 10wdwb-tech, and 5 others: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Joe) I think what @Addshore just found is a good candidate for being the sour... [09:33:10] (03PS1) 10David Caro: wmcs.vps.refresh_puppet_certs: better handle puppetmaster swap [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702082 (https://phabricator.wikimedia.org/T274498) [09:33:12] (03PS1) 10David Caro: wmcs: ran black and isort [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702083 [09:33:14] (03PS1) 10David Caro: wmcs: add default control node to openstack api [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702084 [09:33:16] (03PS1) 10David Caro: wmcs: namespace exceptions [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702085 (https://phabricator.wikimedia.org/T274498) [09:33:18] (03PS1) 10David Caro: wmcs: quote some parameters to openstack [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702086 (https://phabricator.wikimedia.org/T274498) [09:33:20] (03PS1) 10David Caro: wmcs.OpenstackApi: allow soft affinities to be specified [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702087 (https://phabricator.wikimedia.org/T274498) [09:33:22] (03PS1) 10David Caro: wmcs.start_instance_with_prefix: allow passing the affinity [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702088 (https://phabricator.wikimedia.org/T274498) [09:33:24] (03PS1) 10David Caro: wmcs: add kubernetes and kubeadm controllers [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702089 (https://phabricator.wikimedia.org/T274498) [09:33:26] (03PS1) 10David Caro: wmcs.toolforge: add k8s worker add/remove cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702090 (https://phabricator.wikimedia.org/T274498) [09:33:28] (03PS1) 10David Caro: wmcs.toolforge: add task-id to k8s worker cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702091 (https://phabricator.wikimedia.org/T274498) [09:34:06] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/701617 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [09:34:36] (03CR) 10Vgutierrez: [C: 03+2] vcl: Use VCL 4.1 instead of 4.0 [puppet] - 10https://gerrit.wikimedia.org/r/701073 (https://phabricator.wikimedia.org/T285374) (owner: 10Vgutierrez) [09:40:30] (03PS2) 10David Caro: wmcs.vps.refresh_puppet_certs: better handle puppetmaster swap [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702082 (https://phabricator.wikimedia.org/T274498) [09:40:32] (03PS2) 10David Caro: wmcs: ran black and isort [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702083 [09:40:34] (03PS2) 10David Caro: wmcs: add default control node to openstack api [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702084 [09:40:36] (03PS2) 10David Caro: wmcs: namespace exceptions [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702085 (https://phabricator.wikimedia.org/T274498) [09:40:38] (03PS2) 10David Caro: wmcs: quote some parameters to openstack [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702086 (https://phabricator.wikimedia.org/T274498) [09:40:40] (03PS2) 10David Caro: wmcs.OpenstackApi: allow soft affinities to be specified [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702087 (https://phabricator.wikimedia.org/T274498) [09:40:42] (03PS2) 10David Caro: wmcs.start_instance_with_prefix: allow passing the affinity [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702088 (https://phabricator.wikimedia.org/T274498) [09:40:44] (03PS2) 10David Caro: wmcs: add kubernetes and kubeadm controllers [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702089 (https://phabricator.wikimedia.org/T274498) [09:40:46] (03PS2) 10David Caro: wmcs.toolforge: add k8s worker add/remove cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702090 (https://phabricator.wikimedia.org/T274498) [09:40:48] (03PS2) 10David Caro: wmcs.toolforge: add task-id to k8s worker cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702091 (https://phabricator.wikimedia.org/T274498) [09:46:18] (03CR) 10Filippo Giunchedi: Move RPKI alerts to Prometheus/AM (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/700649 (https://phabricator.wikimedia.org/T282806) (owner: 10Ayounsi) [09:49:37] 10SRE, 10Wikidata, 10serviceops, 10wdwb-tech, and 5 others: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Joe) Scavenging the production logs, we found that `Special:EntityData` reque... [09:50:05] (03CR) 10Vgutierrez: [C: 03+1] Switch to nginx-light on all acmechief servers [puppet] - 10https://gerrit.wikimedia.org/r/698511 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [09:50:39] (03PS3) 10David Caro: toolforge.genpp: add buster repos [puppet] - 10https://gerrit.wikimedia.org/r/701062 (https://phabricator.wikimedia.org/T277653) [09:50:43] (03CR) 10David Caro: toolforge.genpp: add buster repos (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/701062 (https://phabricator.wikimedia.org/T277653) (owner: 10David Caro) [09:50:45] (03PS6) 10David Caro: toolforge: Add buster specific packages/setting [puppet] - 10https://gerrit.wikimedia.org/r/700186 (https://phabricator.wikimedia.org/T277653) [09:50:46] (03PS5) 10David Caro: toolforge.exec_environ: add tests [puppet] - 10https://gerrit.wikimedia.org/r/701063 (https://phabricator.wikimedia.org/T277653) [09:50:56] (03CR) 10David Caro: toolforge: Add buster specific packages/setting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/700186 (https://phabricator.wikimedia.org/T277653) (owner: 10David Caro) [09:51:09] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/701063 (https://phabricator.wikimedia.org/T277653) (owner: 10David Caro) [09:51:25] (03PS4) 10Effie Mouzeli: tegola-vector-tiles: add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/701138 (https://phabricator.wikimedia.org/T283159) [09:52:07] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Easing pain points caused by divergence between cloudservices and production puppet usecases - https://phabricator.wikimedia.org/T285539 (10jbond) Thanks for all the responses > The reality is that we create th... [09:58:26] (03CR) 10Muehlenhoff: [C: 03+2] Switch to nginx-light on all acmechief servers [puppet] - 10https://gerrit.wikimedia.org/r/698511 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [10:17:38] (03CR) 10Muehlenhoff: [C: 03+2] role::docker_registry_ha::registry: Switch to profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/698800 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [10:20:00] 10SRE, 10Wikidata, 10serviceops, 10wdwb-tech, and 5 others: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10daniel) > Scavenging the production logs, we found that Special:EntityData re... [10:20:17] (03CR) 10David Caro: [C: 03+2] toolforge.exec_environ: add tests [puppet] - 10https://gerrit.wikimedia.org/r/701063 (https://phabricator.wikimedia.org/T277653) (owner: 10David Caro) [10:20:21] (03CR) 10David Caro: [C: 03+2] toolforge.genpp: add buster repos [puppet] - 10https://gerrit.wikimedia.org/r/701062 (https://phabricator.wikimedia.org/T277653) (owner: 10David Caro) [10:21:00] (03CR) 10David Caro: [C: 03+2] toolforge: Add buster specific packages/setting [puppet] - 10https://gerrit.wikimedia.org/r/700186 (https://phabricator.wikimedia.org/T277653) (owner: 10David Caro) [10:24:09] 10SRE, 10Wikidata, 10serviceops, 10wdwb-tech, and 5 others: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Joe) >>! In T285634#7183188, @daniel wrote: >> Scavenging the production logs... [10:29:13] (03PS3) 10Effie Mouzeli: tegola: add caching support [deployment-charts] - 10https://gerrit.wikimedia.org/r/701369 (owner: 10Jgiannelos) [10:30:20] !log cleanup now unused nginx mods and former deps (various X11 libs and libxslt) on acmechief* after switch towards nginx-light T164456 [10:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:28] T164456: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456 [10:32:09] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:38:13] (03PS1) 10Phuedx: vector: Finish enabling language switcher treatment A/B test on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702095 (https://phabricator.wikimedia.org/T269093) [10:39:42] (03PS2) 10Muehlenhoff: Switch docker registry to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698803 (https://phabricator.wikimedia.org/T164456) [10:53:43] (03PS1) 10Hnowlan: maps: reimage maps2008 as buster replica in new cluster [puppet] - 10https://gerrit.wikimedia.org/r/702099 [10:59:30] (03PS1) 10Muehlenhoff: conf: Switch to profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/702101 (https://phabricator.wikimedia.org/T164456) [11:00:00] (03CR) 10jerkins-bot: [V: 04-1] conf: Switch to profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/702101 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European mid-day backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210629T1100). [11:00:05] Lucas_WMDE, phuedx, and Lucas_WMDE: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:10] o/ [11:00:15] o/ [11:00:25] Lucas_WMDE: you must be a lucky developer. You're mentioned here three times! [11:00:33] what’s the current status? can we deploy or is that UBN still going on? [11:00:36] urbanecm: :P [11:00:38] <_joe_> hey everyone, please hold [11:00:50] <_joe_> it's still ongoing and Amir1 is testing stuff on mwdebug1001 [11:00:53] ok [11:01:12] I tested it, it sure fixes the issue https://performance.wikimedia.org/xhgui/run/view?id=60dafbf574d44f5875da7b39 [11:01:31] but ci needs passing and then backport and deploy [11:02:58] (03PS1) 10Hnowlan: maps: make maps1008 a buster replica of maps1009 [puppet] - 10https://gerrit.wikimedia.org/r/702102 (https://phabricator.wikimedia.org/T269582) [11:04:20] (03PS1) 10Ladsgroup: Use EntityLookup backed TermLookup for Rdf PropertyStubs [extensions/Wikibase] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/702018 (https://phabricator.wikimedia.org/T285634) [11:04:29] (03PS2) 10Muehlenhoff: conf: Switch to profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/702101 (https://phabricator.wikimedia.org/T164456) [11:04:44] (03CR) 10Ladsgroup: [C: 03+2] Use EntityLookup backed TermLookup for Rdf PropertyStubs [extensions/Wikibase] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/702018 (https://phabricator.wikimedia.org/T285634) (owner: 10Ladsgroup) [11:05:01] (03PS1) 10David Caro: toolforge.exec_environ: use libnode-dev on buster [puppet] - 10https://gerrit.wikimedia.org/r/702103 (https://phabricator.wikimedia.org/T277653) [11:05:11] does anyone know wmf.12 is cut or not? [11:05:26] it is, I think [11:05:30] (03PS1) 10Ladsgroup: Use EntityLookup backed TermLookup for Rdf PropertyStubs [extensions/Wikibase] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702019 (https://phabricator.wikimedia.org/T285634) [11:05:32] (03CR) 10David Caro: [C: 03+2] toolforge.exec_environ: use libnode-dev on buster [puppet] - 10https://gerrit.wikimedia.org/r/702103 (https://phabricator.wikimedia.org/T277653) (owner: 10David Caro) [11:05:33] I just ran `git fetch` in my Wikibase.git and a wmf.12 appeared [11:05:34] yeah [11:05:40] (03CR) 10Ladsgroup: [C: 03+2] Use EntityLookup backed TermLookup for Rdf PropertyStubs [extensions/Wikibase] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702019 (https://phabricator.wikimedia.org/T285634) (owner: 10Ladsgroup) [11:05:42] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/702101 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [11:16:04] 10SRE, 10Wikidata, 10serviceops, 10wdwb-tech, and 6 others: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Tarrow) >>! In T285634#7183188, @daniel wrote: > > Did the code change, or i... [11:16:49] (03PS2) 10Tarrow: Use EntityLookup backed TermLookup for Rdf PropertyStubs [extensions/Wikibase] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702019 (https://phabricator.wikimedia.org/T285634) (owner: 10Ladsgroup) [11:17:10] ah so backport it is then [11:19:15] (03PS1) 10Muehlenhoff: relforge::Switch to profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/702106 (https://phabricator.wikimedia.org/T164456) [11:20:14] apergos: already done, halfway through being merged now :D [11:21:04] those 40 minute merges! [11:22:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/702106 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [11:23:46] yeah... [11:25:28] selenium really should get faster, this not sustainable [11:29:41] PROBLEM - SSH on mw1303.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:33:02] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30022/console" [puppet] - 10https://gerrit.wikimedia.org/r/702102 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [11:33:21] (03CR) 10jerkins-bot: [V: 04-1] Use EntityLookup backed TermLookup for Rdf PropertyStubs [extensions/Wikibase] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/702018 (https://phabricator.wikimedia.org/T285634) (owner: 10Ladsgroup) [11:33:25] (03PS1) 10Vgutierrez: varnish: Avoid requiring virtualbox on Vagrant based tests [puppet] - 10https://gerrit.wikimedia.org/r/702108 [11:33:31] (03Merged) 10jenkins-bot: Use EntityLookup backed TermLookup for Rdf PropertyStubs [extensions/Wikibase] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/702018 (https://phabricator.wikimedia.org/T285634) (owner: 10Ladsgroup) [11:33:38] orilly [11:34:45] probably sync PropertyStubRdfBuilder.php first? [11:35:10] !log ladsgroup@deploy1002 sync-file aborted: Backport: [[gerrit:702018|Use EntityLookup backed TermLookup for Rdf PropertyStubs (T285634)]] (duration: 00m 10s) [11:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:17] yeah, let's do that [11:35:19] T285634: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 [11:36:47] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.11/extensions/Wikibase/repo/includes/Rdf/PropertyStubRdfBuilder.php: Backport: [[gerrit:702018|Use EntityLookup backed TermLookup for Rdf PropertyStubs (T285634)]], Part I (duration: 00m 56s) [11:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:03] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.11/extensions/Wikibase/repo/: Backport: [[gerrit:702018|Use EntityLookup backed TermLookup for Rdf PropertyStubs (T285634)]], Part II (duration: 00m 58s) [11:38:08] (03PS1) 10Muehlenhoff: relforge: Switch to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/702109 (https://phabricator.wikimedia.org/T164456) [11:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:38] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/702109 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [11:41:04] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30023/console" [puppet] - 10https://gerrit.wikimedia.org/r/702099 (owner: 10Hnowlan) [11:41:05] this is back to almost zero https://grafana.wikimedia.org/d/u5wAugyik/wikibase-statsdrecordingsimplecache?viewPanel=12&orgId=1&from=now-3h&to=now [11:41:51] i guess that's good news [11:43:28] _joe_: I think we are done here [11:43:46] waiting for wmf.12 to merge and backport [11:44:29] those numbers look great indeed [11:44:58] was .12 already synced to the cluster? if not, just merging it is enough afaik [11:45:00] (03CR) 10Ladsgroup: [C: 03+2] "hmm, it doesn't look like it got triggered." [extensions/Wikibase] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702019 (https://phabricator.wikimedia.org/T285634) (owner: 10Ladsgroup) [11:45:05] I’ll postpone my config changes, since I’m in a meeting at the moment [11:45:24] maybe phuedx’ change could still happen? [11:45:56] majavah: it needs a rebase on deploy1002 anyway [11:46:23] the gate and submit wasn't triggered [11:46:49] If everything is OK and there is enough time, I would appreciate it [11:46:54] that's going to take 25 minutes or so [11:47:00] phuedx: sure, go ahead [11:48:12] On it [11:48:16] 10SRE, 10Wikidata, 10serviceops, 10wdwb-tech, and 6 others: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Ladsgroup) This is basically done, we just need to wait to see if it continue... [11:48:23] (03CR) 10Phuedx: [C: 03+2] vector: Finish enabling language switcher treatment A/B test on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702095 (https://phabricator.wikimedia.org/T269093) (owner: 10Phuedx) [11:49:09] (03Merged) 10jenkins-bot: vector: Finish enabling language switcher treatment A/B test on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702095 (https://phabricator.wikimedia.org/T269093) (owner: 10Phuedx) [11:49:27] Pulling the above onto mwdebug1001 and testing [11:50:44] (03PS1) 10Filippo Giunchedi: mtail: use non-deprecated log.warning [puppet] - 10https://gerrit.wikimedia.org/r/702110 (https://phabricator.wikimedia.org/T285534) [11:51:30] (03PS1) 10Muehlenhoff: cloudelastic: Switch to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/702111 (https://phabricator.wikimedia.org/T164456) [11:52:38] (03CR) 10DCausse: [C: 03+1] relforge: Switch to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/702109 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [11:52:56] LGTM. Syncing [11:53:19] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/702111 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [11:54:18] !log phuedx@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:702095|vector: Finish enabling language switcher treatment A/B test on fawiki (T269093)]] (duration: 00m 56s) [11:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:29] T269093: Deploy new language switching location to test wikis and begin A/B test pt 1 - https://phabricator.wikimedia.org/T269093 [11:55:29] Done [11:59:34] (03PS1) 10Muehlenhoff: swift proxies: Switch to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/702113 (https://phabricator.wikimedia.org/T164456) [12:00:06] (03CR) 10jerkins-bot: [V: 04-1] swift proxies: Switch to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/702113 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [12:01:42] (03PS2) 10Muehlenhoff: swift proxies: Switch to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/702113 (https://phabricator.wikimedia.org/T164456) [12:01:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/702113 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [12:05:01] (03Merged) 10jenkins-bot: Use EntityLookup backed TermLookup for Rdf PropertyStubs [extensions/Wikibase] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702019 (https://phabricator.wikimedia.org/T285634) (owner: 10Ladsgroup) [12:15:38] (03PS1) 10Muehlenhoff: maps: Switch to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/702114 (https://phabricator.wikimedia.org/T164456) [12:20:59] (03CR) 10Ema: [C: 03+1] varnish: Avoid requiring virtualbox on Vagrant based tests [puppet] - 10https://gerrit.wikimedia.org/r/702108 (owner: 10Vgutierrez) [12:23:20] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/702114 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [12:24:40] (03PS1) 10Filippo Giunchedi: mtail: parse 3.0.0~rc43 store format [puppet] - 10https://gerrit.wikimedia.org/r/702116 (https://phabricator.wikimedia.org/T285534) [12:30:26] (03CR) 10Ema: "LGTM but one nit." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/702116 (https://phabricator.wikimedia.org/T285534) (owner: 10Filippo Giunchedi) [12:30:57] (03CR) 10Ema: [C: 03+1] mtail: use non-deprecated log.warning [puppet] - 10https://gerrit.wikimedia.org/r/702110 (https://phabricator.wikimedia.org/T285534) (owner: 10Filippo Giunchedi) [12:31:31] (03PS1) 10Muehlenhoff: conf: Switch to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/702117 (https://phabricator.wikimedia.org/T164456) [12:33:39] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/702117 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [12:34:25] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "just like we did on other services and https://puppet-compiler.wmflabs.org/compiler1001/30025/registry1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/698803 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [12:50:09] (03CR) 10Vgutierrez: [C: 03+2] varnish: Avoid requiring virtualbox on Vagrant based tests [puppet] - 10https://gerrit.wikimedia.org/r/702108 (owner: 10Vgutierrez) [12:55:11] (03PS2) 10Muehlenhoff: conf: Switch to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/702117 (https://phabricator.wikimedia.org/T164456) [12:59:43] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/702117 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [13:18:29] (03PS1) 10Dzahn: gitlab: move service IP settings from common to DC level [puppet] - 10https://gerrit.wikimedia.org/r/702123 (https://phabricator.wikimedia.org/T285456) [13:18:59] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install frdev1002 - https://phabricator.wikimedia.org/T282054 (10Jgreen) >>! In T282054#7181417, @Cmjohnson wrote: > @Jgreen the idrac is set up, the password is set to the temporary DM if you do not remember. The production port is not set up. Which VL... [13:19:46] 10SRE, 10Wikidata, 10serviceops, 10wdwb-tech, and 6 others: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Joe) 05Open→03Resolved Data on the number of apcu gets/s normalized after... [13:29:40] (03PS1) 10Dzahn: gitlab: add parameter for active_host, limit backups to it [puppet] - 10https://gerrit.wikimedia.org/r/702126 (https://phabricator.wikimedia.org/T285456) [13:30:41] RECOVERY - SSH on mw1303.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:31:18] !log otto@deploy1002 Started deploy [analytics/refinery@edc31a2]: Regular analytics weekly train [analytics/refinery@COMMIT_HASH] [13:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:27] (03PS2) 10Dzahn: gitlab: add parameter for active_host, limit backups to it [puppet] - 10https://gerrit.wikimedia.org/r/702126 (https://phabricator.wikimedia.org/T285456) [13:34:11] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:34:53] !log volker-e@deploy1002 Started deploy [design/style-guide@e97fccb]: Deploy design/style-guide: e97fccb styles: Add internationalization and accessibility note labels and treatments (#476) [13:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:00] !log volker-e@deploy1002 Finished deploy [design/style-guide@e97fccb]: Deploy design/style-guide: e97fccb styles: Add internationalization and accessibility note labels and treatments (#476) (duration: 00m 07s) [13:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:37] 10SRE, 10GitLab, 10serviceops, 10vm-requests, 10Patch-For-Review: codfw: 1 of VMs requested for gitlab - https://phabricator.wikimedia.org/T285456 (10Dzahn) Next we need to reserve a service IP for this new host in netbox and after the changes above are merged we can then add that new service IP to Hiera... [13:35:42] (03CR) 10Jgiannelos: [C: 03+1] maps: fix osm sync directory path [puppet] - 10https://gerrit.wikimedia.org/r/701558 (owner: 10MSantos) [13:38:07] (03PS1) 10Marostegui: wmnet: Change masters cnames [dns] - 10https://gerrit.wikimedia.org/r/702128 (https://phabricator.wikimedia.org/T281515) [13:38:21] (03PS1) 10Ottomata: camus events - change which DC topic is used for Hadoop ingestion alerts [puppet] - 10https://gerrit.wikimedia.org/r/702129 (https://phabricator.wikimedia.org/T266798) [13:39:22] (03PS2) 10Filippo Giunchedi: mtail: parse 3.0.0~rc43 store format [puppet] - 10https://gerrit.wikimedia.org/r/702116 (https://phabricator.wikimedia.org/T285534) [13:39:34] (03CR) 10Filippo Giunchedi: mtail: parse 3.0.0~rc43 store format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/702116 (https://phabricator.wikimedia.org/T285534) (owner: 10Filippo Giunchedi) [13:39:42] <_joe_> godog: did they change the format *again*? [13:40:29] _joe_: haha no not really, the latest version actually DTRT and just prints json on stdout when asked [13:40:33] (03CR) 10Dzahn: [V: 04-1] "Error while evaluating a Resource Statement, Class[Tilerator]: has no parameter named 'osm_dir'" [puppet] - 10https://gerrit.wikimedia.org/r/701558 (owner: 10MSantos) [13:40:35] without the "metric store:" prefix [13:40:48] <_joe_> yeah still, I love their versioning [13:41:18] yeah it is amusing at this point [13:41:34] (03CR) 10Ottomata: "Re all the files and merged_hash; we're encountering this problem for user groups too, I'd like to revisit this problem in general." [puppet] - 10https://gerrit.wikimedia.org/r/701512 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff) [13:41:47] (03CR) 10Kormat: [C: 03+1] wmnet: Change masters cnames [dns] - 10https://gerrit.wikimedia.org/r/702128 (https://phabricator.wikimedia.org/T281515) (owner: 10Marostegui) [13:42:58] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/702113 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [13:43:00] (03CR) 10Ottomata: [C: 03+1] logstash: transition aqs logs to ECS [puppet] - 10https://gerrit.wikimedia.org/r/701617 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [13:45:31] 10SRE, 10Wikidata, 10serviceops, 10wdwb-tech, and 6 others: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Jdforrester-WMF) [13:47:35] 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Datacenter-Switchover: CommRel support for June 2021 Switchover - https://phabricator.wikimedia.org/T281209 (10Marostegui) The banner on `eswiki` has an error on the maintenance times: it mentions 05:00 UTC - 05:30 UTC [13:47:51] 10SRE, 10SRE-Access-Requests: Requesting access to analytics cluster for btullis - https://phabricator.wikimedia.org/T285754 (10BTullis) Should I create a separate request for the LDAP group change? I see from [[ https://wikitech.wikimedia.org/wiki/Analytics/Team/Onboarding#LDAP | here ]] that there is a speci... [13:48:10] 10SRE, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to analytics cluster for btullis - https://phabricator.wikimedia.org/T285754 (10BTullis) [13:48:21] 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Datacenter-Switchover: CommRel support for June 2021 Switchover - https://phabricator.wikimedia.org/T281209 (10Marostegui) Looks like it is everywhere, not only `eswiki`. [13:48:35] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:49:00] !log otto@deploy1002 Finished deploy [analytics/refinery@edc31a2]: Regular analytics weekly train [analytics/refinery@COMMIT_HASH] (duration: 17m 42s) [13:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:24] I did setup the switchdc tmux session on cumin1001. Please join with `sudo -i tmux attach -rt switchdc` with a decent terminal size! [13:49:28] !log otto@deploy1002 Started deploy [analytics/refinery@edc31a2] (thin): Regular analytics weekly train THIN [analytics/refinery@edc31a2] [13:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:35] !log otto@deploy1002 Finished deploy [analytics/refinery@edc31a2] (thin): Regular analytics weekly train THIN [analytics/refinery@edc31a2] (duration: 00m 07s) [13:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:53] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/702116 (https://phabricator.wikimedia.org/T285534) (owner: 10Filippo Giunchedi) [13:50:14] (03CR) 10Cwhite: [C: 03+1] mtail: use non-deprecated log.warning [puppet] - 10https://gerrit.wikimedia.org/r/702110 (https://phabricator.wikimedia.org/T285534) (owner: 10Filippo Giunchedi) [13:50:28] <_joe_> ottomata: we're about to start the switchover, it would be appreciated if we didn't release anything in the meantime [13:50:29] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:51:12] !log otto@deploy1002 Started deploy [analytics/refinery@edc31a2] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@edc31a2] [13:51:14] 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Datacenter-Switchover: CommRel support for June 2021 Switchover - https://phabricator.wikimedia.org/T281209 (10SHB2000) It's come on enwikivoyage as well. Mentions maintenance times: 05:00 UTC - 05:30 UTC but it's only around 14:00 at UTC [13:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:02] (03PS1) 10Muehlenhoff: Convert sretest-logout.py to wmflib.idm [puppet] - 10https://gerrit.wikimedia.org/r/702133 [13:52:11] jayme: thank you! what's the acceptable minimum terminal size? :D [13:53:25] (03CR) 10jerkins-bot: [V: 04-1] Convert sretest-logout.py to wmflib.idm [puppet] - 10https://gerrit.wikimedia.org/r/702133 (owner: 10Muehlenhoff) [13:53:35] I would like to have it not smaller than the 165x41 it is now :D [13:53:53] +1 [13:53:56] (03PS1) 10Jbond: postgress::user: Add new grant resource [puppet] - 10https://gerrit.wikimedia.org/r/702134 (https://phabricator.wikimedia.org/T285666) [13:54:35] (03CR) 10jerkins-bot: [V: 04-1] postgress::user: Add new grant resource [puppet] - 10https://gerrit.wikimedia.org/r/702134 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [13:54:45] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30027/console" [puppet] - 10https://gerrit.wikimedia.org/r/702134 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [13:55:22] (03PS2) 10Muehlenhoff: Convert sretest-logout.py to wmflib.idm [puppet] - 10https://gerrit.wikimedia.org/r/702133 [13:55:26] <_joe_> I think we can start with the non-critical phases whenever you feel like legoktm [13:55:39] notes for stuff that doesn't go right: https://etherpad.wikimedia.org/p/2021-switchdc-notes [13:55:47] Good luck all [13:56:36] <_joe_> wut [13:56:40] hrhr [13:56:41] <_joe_> like, now? [13:56:41] (03CR) 10jerkins-bot: [V: 04-1] Convert sretest-logout.py to wmflib.idm [puppet] - 10https://gerrit.wikimedia.org/r/702133 (owner: 10Muehlenhoff) [13:56:58] he's trolling us [13:57:03] <_joe_> definitely [13:57:11] <_joe_> the ordering is correct legoktm [13:57:13] args lgtm [13:57:21] jayme is typing, right? [13:57:23] +1 [13:57:24] yes [13:57:28] rzl: yes [13:57:33] o/ [13:57:46] i'll try to restart the bot.. oh too late [13:58:04] (03PS6) 10Vgutierrez: varnish: Add listen on UDS support [puppet] - 10https://gerrit.wikimedia.org/r/701056 (https://phabricator.wikimedia.org/T285374) [13:58:22] same game as yesterday. I'll wait a bit before every step for you all to yell and stop me from pressing enter [13:58:52] terminal size please [13:58:57] <_joe_> yes, please [13:59:05] <_joe_> whoever just joined, quit. [13:59:26] <_joe_> jayme: please just stop before step 02 [13:59:27] (03CR) 10jerkins-bot: [V: 04-1] varnish: Add listen on UDS support [puppet] - 10https://gerrit.wikimedia.org/r/701056 (https://phabricator.wikimedia.org/T285374) (owner: 10Vgutierrez) [13:59:28] * mutante quit [13:59:32] (and I will wait whenever someone downsizes the terminal :P) [13:59:41] jayme: I think you're ready to go through steps 00 and 01, we have to wait 5 minutes for DNS TTL anyways [13:59:43] _joe_: sure [13:59:52] ack. Starting now [13:59:52] mutante: you're fine to rejoin, just enlarge your term first please -- the shared session takes the size of the smallest connected window [13:59:57] <_joe_> no, in the 02-07 phase we want to go on as soon as we get a go [14:00:05] legoktm and jayme: Dear deployers, time to do the Datacenter Switchover: MediaWiki deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210629T1400). [14:00:50] _joe_: ack to that. set-readonly to set-readwrite we should do fast (as rzl did last time IIRC) [14:00:57] yep [14:00:58] <_joe_> yes, as fast as sensible [14:01:08] stop if you see errors or hear shouting, proceed otherwise [14:01:25] fingers crossed :) [14:01:30] ok. disabling puppet now [14:01:31] good luck :) [14:01:34] !log jayme@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-disable-puppet [14:01:37] !log jayme@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0) [14:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:56] reminder that 00-reduce-ttl will sleep five minutes now [14:02:01] so it'll ruin the party atmosphere for a little [14:02:08] terminal size please [14:02:10] I didn't add a "Blame Joe." line but you can imagine your own [14:02:11] <_joe_> rzl: it doesn't IIRC [14:02:18] <_joe_> oh you added the wait back? [14:02:20] it does as of Friday [14:02:25] yeah, because the warmup is faster now [14:02:29] 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Datacenter-Switchover: CommRel support for June 2021 Switchover - https://phabricator.wikimedia.org/T281209 (10sgrabarczuk) Fixed right after the first comment about eswiki was sent. Thank you! [14:02:32] <_joe_> ah right [14:02:35] <_joe_> let's proceed then [14:02:40] let's explore that :) [14:02:42] !log jayme@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl [14:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:08] <_joe_> why is this still waiting just 3 seconds? [14:03:11] <_joe_> I thought we fixed that [14:03:31] rzl: thanks, just tail -f /var/log/spicerack/sre/switchdc-extended.log works and can't mess with it [14:03:42] mutante: cheer [14:03:44] *cheers [14:03:50] * legoktm adds to notes [14:04:50] _joe_: yet another moment for you to stream some elevator music [14:05:24] <_joe_> jayme: I have some nice music right now going on in my studio, we should meet on kumospace so I can stream to yall [14:05:33] kumospace switchover!! [14:05:50] can you put a big red button in kumospace? [14:05:52] me too, I was keeping the party atmosphere going in my room [14:06:00] you walk over from room eqiad to room codfw and the music volume slowly changes [14:06:12] <_joe_> rzl: the added bonus is you can see my facial expressions if something goes slightly not according to plan [14:06:19] lol [14:06:24] yeah it's good monitoring [14:06:36] oh speaking of, I'm putting up http://listen.hatnote.com/ in the background again :) [14:06:58] <_joe_> yeah good idea [14:07:22] don't forget to pick a few languages for coverage, it defaults to just enwiki [14:07:35] are we on cumin1001 again? [14:07:42] apergos: yes [14:07:43] yep [14:08:12] !log jayme@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) [14:08:15] still in the ttl window? [14:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:17] ah finally [14:09:02] !log jayme@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-warmup-caches [14:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:10] jayme: lgtm [14:09:20] <_joe_> jayme: go [14:09:34] ha, they're already warmish from yesterday :) good [14:10:09] !log jayme@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0) [14:10:13] <_joe_> not really, why is load.php taking so long? [14:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:39] I didn't expect it to *stay* there, I thought it would speed up a bit from there [14:10:43] <_joe_> I'd run it again tbh [14:10:47] we were looking at more like 30 sec on a truly cold cache [14:10:56] yeah, I'd rerun too but I'm not sure it's going to help [14:11:08] <_joe_> just to be surer :P [14:11:15] okay, rerunning 00-warmup-caches [14:11:26] !log jayme@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-warmup-caches [14:11:29] marostegui, kormat: just to verify, did you already downtime the read only alerts? [14:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:35] I think that's one of the bits Krinkle just added, I don't have a good sense of how fast it "should" be but 3 sec is surprising [14:11:48] legoktm: i have not. marostegui: ? [14:12:26] legoktm: I haven't [14:12:32] legoktm: I didn't know we had to :) [14:12:36] !log jayme@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0) [14:12:40] could one of you please :) [14:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:49] it'll just page everyone otherwise [14:12:55] rzl: I have warmed up the databases too, so maybe that'll help too [14:13:00] * legoktm makes a note to add to docs [14:13:13] _joe_: yeah that's where we stay I guess -- not sure why there's such a wide range between min and max? [14:13:13] kormat: can you downtime the masters? [14:13:19] rzl: 30 or 3s? (load.php is not a new entry) [14:13:23] timing: min = 0.1511215s | max = 3.480866712s | avg = 1.8757435517968757s | total = 6s [14:13:25] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 44 hosts with reason: DC switchover [14:13:29] marostegui: done. [14:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:29] Krinkle: three [14:13:35] kormat: <3 [14:13:42] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 44 hosts with reason: DC switchover [14:13:42] <_joe_> 3s but now I tried the query to one of the hosts and it's comparable to eqiad [14:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:50] thanks kormat [14:14:01] <_joe_> we are go for me [14:14:10] yeah I'm happy too [14:14:29] <_joe_> time curl -H 'Host: en.wikipedia.org' 'https://mw2374.codfw.wmnet/w/load.php?lang=en&modules=startup&only=scripts&raw=1&skin=vector' -k is 0.55 secs from puppetmaster1001 [14:14:36] <_joe_> vs 0.450 to mw1331 [14:14:41] <_joe_> so I think it's ok [14:14:54] ok. stopping maintenance jobs [14:14:58] <_joe_> jayme: go [14:15:00] !log jayme@cumin1001 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance [14:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:22] <_joe_> timers have been all stopped [14:15:32] verified all non-php-fpm processes are gone from mwmaint1002 [14:15:34] !log jayme@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) [14:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:47] "stray php processes, please check" on both maintenance hosts [14:15:52] and my manual php scripts as well (as intended) [14:16:13] <_joe_> I disagree mwmaint1002:~$ ps -ef | grep php gives only php-fpm [14:16:23] <_joe_> so the maintenance is effectively stopped [14:16:35] same with codfw [14:16:41] <_joe_> mwmaint1002:~$ sudo crontab -u www-data -l [14:16:43] <_joe_> no crontab for www-data [14:16:47] yeah, let's figure out why that check is lying :) [14:16:48] <_joe_> we are go IMHO [14:16:52] but I think we're all set [14:17:00] <_joe_> rzl: i think it doesn't exclude php-fpm [14:17:12] it does "! pgrep -c php" [14:17:12] <_joe_> yeah pgrep php doesn't [14:17:13] (sorry, I mean, let's figure out later, no reason to hold the switch for it) [14:17:39] to verify, for 02-07 it's go as fast as possible unless stuff breaks? [14:17:39] <_joe_> yeah we're a go as far as I'm concerned [14:17:48] as fast as sensible :) [14:17:50] I'm no longer seeing a maintenance banner on-wiki btw [14:17:57] but I think we go ahead anyway [14:18:07] rzl: same [14:18:08] legoktm: ^ to note for followup with commrel [14:18:19] rzl: do you want me to enable the banner again? [14:18:25] huh indeed, the campaign stopped already, that's odd [14:18:27] up to legoktm [14:18:43] hey SGrabarczuk [14:18:46] yay, I've made it :facepalm: [14:18:48] I did confirm with Szymon Grabarczuk that the switchover is happening (via email) [14:18:49] oh good :) [14:19:03] last time I was setting up a cloak was a decade ago [14:19:04] SGrabarczuk: we were just noting it looks like the maintenance banner is gone, is that intended? [14:19:11] yup [14:19:28] read-only time hasn't started yet, but will soon start [14:19:53] note that maint jobs are already down, so we are not without impact atm [14:20:02] <_joe_> yeah I'd like to move on [14:20:04] yeah agree [14:20:21] jayme: please proceed [14:20:22] <_joe_> we can live without maintenance jobs runnign for a while though [14:20:28] Okay then. I'll go 02-07 without pausing unless someone stops me by highlight or a cookbook fails [14:20:37] +1 [14:20:42] terminal size please [14:20:56] <_joe_> sorry my bad [14:20:58] hey whoever just joined with tiny terminal can [14:21:02] yes please make bigger ty [14:21:25] !log jayme@cumin1001 START - Cookbook sre.switchdc.mediawiki.02-set-readonly [14:21:26] !log jayme@cumin1001 MediaWiki read-only period starts at: 2021-06-29 14:21:26.671853 [14:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:45] <_joe_> I see no new edits [14:21:46] edits have stopped on L2W [14:21:47] !log jayme@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0) [14:21:49] hatnote went silent [14:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:52] hatnote went quet [14:21:52] the forest has becometh silent [14:21:52] !log jayme@cumin1001 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly [14:21:54] <_joe_> go go go [14:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:25] !log jayme@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0) [14:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:30] !log jayme@cumin1001 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki [14:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:53] !log jayme@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0) [14:22:55] !log jayme@cumin1001 START - Cookbook sre.switchdc.mediawiki.05-invert-redis-sessions [14:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:58] !log jayme@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.05-invert-redis-sessions (exit_code=0) [14:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:05] !log jayme@cumin1001 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite [14:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:08] !log jayme@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) [14:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:10] !log jayme@cumin1001 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite [14:23:13] "failed to check" is OK btw, I thought I'd fixed that one [14:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:17] just means it's hasn't updated yet [14:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:23] !log jayme@cumin1001 MediaWiki read-only period ends at: 2021-06-29 14:23:23.504447 [14:23:23] !log jayme@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) [14:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:29] \o/ [14:23:31] <_joe_> edits starting [14:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:32] hatnotes! [14:23:33] nice! [14:23:35] \*/ [14:23:37] <_joe_> let's not be too happy though [14:23:38] hatnote resumed signing [14:23:43] <_joe_> let's see hwo we handle the load [14:23:47] checking dashboards yeah [14:23:51] edits taking ages for now [14:23:55] But I can edit [14:23:59] <_joe_> requests moved [14:24:19] Tad slow but yeah we're read write and the personal links says codfw [14:24:21] <_joe_> latencies are horrible in codfw right now [14:24:31] <_joe_> can someone look at api latencies [14:24:55] I really, really love that hatnote has become a valuable dc-switch monitoring tool :) [14:25:00] they also look really high on the read dashboard, for API [14:25:02] _joe_: looking [14:25:06] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:25:18] <_joe_> latencies are going down right now [14:25:26] databases are looking okish for now [14:25:33] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 659 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:25:38] yeah it seems a temporary spike (codfw apis) [14:25:39] api latencies peaked at double-digit seconds but droping [14:25:41] dropping [14:25:41] they are horrible, memcached traffic is increasing [14:25:47] let's give it a minute [14:25:55] <_joe_> let's go with the restart envoy step please [14:25:59] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [14:26:08] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki appserver at codfw #page on alert1001 is CRITICAL: 0.2115 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver [14:26:08] jayme: ^ [14:26:11] I am checking an error related to: Unknown database 'trwikivoyage' [14:26:11] <_joe_> it's pretty urgent [14:26:18] <_joe_> oh wait [14:26:18] for parsoid, latencies, initial spike but also going down [14:26:26] <_joe_> this page is expdedcted [14:26:26] waiting [14:26:30] <_joe_> mutante: thanks for checking [14:26:54] Latency seems fine now just browsing meta [14:27:01] <_joe_> the page is going to resolve soon [14:27:09] indeed, https://tr.wikivoyage.org/ is down [14:27:16] <_joe_> wtf [14:27:20] legoktm: I think we just need a bit more coverage in the api cache warmup [14:27:22] marostegui: i know what's with trwikivoyage [14:27:23] is it a new wiki not in codfw dbs? [14:27:24] We have a problem with trwikivoyage [14:27:28] it's missing from db-codfw.php [14:27:32] trwikivoyage was created in Jan 2021 [14:27:36] created in January https://github.com/wikimedia/operations-mediawiki-config/commit/d9d64bb041b43248402c23c9f7ae8ed4e787fbba [14:27:36] my mistake from wiki creation, probably :/ [14:27:38] urbanecm: looks like it is being searched for in s3 [14:27:41] but it is on s5 [14:27:41] <_joe_> ook, can someone add it? [14:27:47] _joe_: uploading a patch [14:27:49] <_joe_> while we proceed? [14:27:49] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POS [14:27:51] urbanecm: thanks [14:28:01] just to ask the question: do we stay in codfw or is this worth rolling back for? probably stay [14:28:05] <_joe_> I think we can proceed anyways, right? [14:28:06] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki appserver at codfw #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.7285 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver [14:28:12] _joe_: we can, it is just that one [14:28:19] !log TODO: Don't duplicate `sectionsByDB` between db-* files [14:28:21] should be an easy fix (a line on db-codfw.php I guess) [14:28:21] <_joe_> rzl: I'd vote we stay if the db exists [14:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:24] rzl: it seems like a MW config error, not an actual missing DB [14:28:31] okay, stay sgtm [14:28:34] +1 stay [14:28:42] <_joe_> yeah let's proceed with 08 please [14:28:42] the database does exist [14:28:44] who's patching? [14:28:47] (03PS1) 10Urbanecm: db-codfw: Fix trwikivoyage entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702136 [14:28:47] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at codfw on alert1001 is CRITICAL: 0.5469 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [14:28:51] <_joe_> Krinkle: urbanecm is [14:29:00] (03CR) 10Krinkle: [C: 03+1] db-codfw: Fix trwikivoyage entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702136 (owner: 10Urbanecm) [14:29:03] (03CR) 10Marostegui: [C: 03+1] db-codfw: Fix trwikivoyage entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702136 (owner: 10Urbanecm) [14:29:05] patch uploaded, needs just a sync in theory [14:29:06] (03CR) 10Legoktm: [V: 03+2 C: 03+2] db-codfw: Fix trwikivoyage entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702136 (owner: 10Urbanecm) [14:29:18] okay, as this is unrelated to the upcomin steps AIUI, I'll proceed legoktm rzl _joe_ [14:29:23] jayme: +1 [14:29:26] I'll sync this out [14:29:35] !log jayme@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-restart-envoy-on-jobrunners [14:29:37] !log jayme@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restart-envoy-on-jobrunners (exit_code=0) [14:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:41] urbanecm: s5.dblist is fine, just checked [14:29:43] jayme: wait a bit before restore-ttl [14:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:45] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [14:29:50] rzl: ack [14:29:58] marostegui: it was a typo in the db name in codfw config files :/ [14:30:08] it happens [14:30:10] marostegui: I think the db-*.php files is only thing that matters for connections [14:30:14] jayme: that's our "we're confident we don't need to switch back quickly" step, let's make sure this db fix works as expected [14:30:16] urbanecm: yep :) [14:30:30] rzl: sure, got you [14:30:32] no reason to think it shouldn't, but no reason to close off our options either [14:30:39] <_joe_> +1 [14:30:53] oh, do start maintenance though IMO [14:30:56] <_joe_> is s1 in codfw suffering? [14:30:57] _joe_: ^? [14:30:57] !log legoktm@deploy1002 Synchronized wmf-config/db-codfw.php: fix trwikivoyage (duration: 01m 01s) [14:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:04] trwikivoyage back up [14:31:07] <_joe_> rzl: I'd wait tbh [14:31:11] kk [14:31:12] there's a bunch of near-saturated codfw api servers [14:31:12] <_joe_> ack [14:31:15] <_joe_> ok let's go [14:31:19] compare https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1&var-datasource=codfw%20prometheus%2Fops&from=now-1h&to=now [14:31:20] vs [14:31:22] https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1&var-datasource=eqiad%20prometheus%2Fops&from=now-1h&to=now [14:31:49] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [14:31:49] _joe_: it is s8 [14:31:51] obvious question, is that saturation from cold caches or something else? [14:31:54] <_joe_> cdanis: ddo you know which servers? [14:32:09] RECOVERY - Check systemd state on mw1384 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:13] <_joe_> rzl: yes, we have so many layers of caching that it's normal to have a transient phase [14:32:17] _joe_: I am checking individual hosts to see what we can do [14:32:42] <_joe_> marostegui: do we need to wait before proceeding further? [14:32:46] latency for API servers is down to 300ms and falling, but was 170 in eqiad [14:32:58] _joe_: give me a minute [14:33:16] if there are notes for what to improve for next time, T260297 would prevent the tr.wikivoyage issue [14:33:17] T260297: Ensure dblist shard files match db-*.php definitions - https://phabricator.wikimedia.org/T260297 [14:33:18] mw2383 is the most saturated, 5x more so than the second-most [14:33:19] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:33:25] _joe_: https://w.wiki/3ZS9 [14:33:30] https://grafana.wikimedia.org/goto/4BzDXhk7k [14:33:37] <_joe_> rzl: can you check the weights of those servers? [14:33:42] looking [14:33:43] <_joe_> in lvs [14:34:00] they all have 30 here https://config-master.wikimedia.org/pybal/codfw/apaches [14:34:09] beat me to it, thanks [14:34:17] <_joe_> mutante: apis [14:34:21] <_joe_> not apaches [14:34:28] mw2362 mw2284 mw2261 mw2294 mw2290 mw2286 mw2262 are the worst off [14:34:29] <_joe_> also look at the https version [14:34:32] same, also all 30 https://config-master.wikimedia.org/pybal/codfw/api [14:34:52] _joe_: things look fine now in s8 [14:35:01] <_joe_> mutante: https [14:35:07] <_joe_> also why all 30 [14:35:19] (argh, my grafana link was appservers not api_appservers too) [14:35:33] api-https is also all 30 [14:35:40] <_joe_> yeah not great [14:35:45] correct answer, most saturated are mw2286, 2294, 2261 [14:35:46] <_joe_> I did say we needed tocheck [14:35:53] what's it supposed to be? [14:36:04] <_joe_> so I would suggest to lower slightly the weight of the older, less performant servers [14:36:14] codfw api_appserver saturations https://grafana.wikimedia.org/goto/WzRhu2z7k [14:36:15] urbanecm: confirmed, db error gone [14:36:20] <_joe_> !log depooling mw2383 [14:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:38] <_joe_> ok, let's proceed please, this can be fixed while the switchover completes [14:36:49] +1 for finishing phase 8 [14:37:12] jayme: ^^ [14:37:12] <_joe_> !log repooling mw2383 [14:37:15] mw2383 was procured in Feb 2021 [14:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:24] ack [14:37:31] looks lik ein eqiad we had 25-vs-30, not sure if it's the same generational diff? [14:37:40] <_joe_> we might need to do a roll restart of the stuck apis [14:37:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Reduce db2103 (s1) weight a bit', diff saved to https://phabricator.wikimedia.org/P16739 and previous config saved to /var/cache/conftool/dbconfig/20210629-143742-marostegui.json [14:37:45] !log jayme@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-restore-ttl [14:37:45] <_joe_> let me confirm [14:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:03] _joe_: weights first and then roll-restart, right? [14:38:06] !log jayme@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restore-ttl (exit_code=0) [14:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:11] <_joe_> !log restarting pohp-fpm on mw2383 [14:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:17] <_joe_> rzl: not sure [14:38:39] jayme: go ahead [14:38:48] <_joe_> ok a restart still didn't solve the issue on mw2383 it seems [14:38:57] !log jayme@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-run-puppet-on-db-masters [14:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:25] kormat: we'll need to do this manually, can you take care of it? https://phabricator.wikimedia.org/T266723 [14:39:43] kormat: it can be done after we are fully done with the switchover, no rush [14:39:49] marostegui: ack [14:39:53] <_joe_> ok so [14:40:25] (03PS7) 10Vgutierrez: varnish: Add listen on UDS support [puppet] - 10https://gerrit.wikimedia.org/r/701056 (https://phabricator.wikimedia.org/T285374) [14:40:31] <_joe_> mw2383 is a mystery tbh [14:41:00] most of this is brand new hardware too [14:41:04] seems like the model diff cutoff is mw2291 and higher are newer models than mw2290 and below [14:41:12] _joe_: just to be 100%, 2383 was a misdirect on my part -- it's an appserver, not an api [14:41:19] it's still a mystery, just a different one [14:41:29] (for the weights) [14:41:38] mw2251 through mw2258 are oldest [14:41:48] and shouldnt have 30 [14:42:03] <_joe_> ok I would go with 25 for all servers mw2290 and below in api [14:42:06] then mw2291 as bblack said [14:42:11] api_appserver saturations look like they're converging over the last few minutes [14:42:12] There were ~ 100 reqs in eqiad fatalling with "DBQueryError: MariaDB server is running with the --read-only option so it cannot execute this" which suggests either the code ignored MW readonly state, or maybe the db side changed before MW, or they were just very slow requests. Probably the latter, but recording it here just in case. [14:42:17] ack, should i do the weight fix? [14:42:21] did somebody change weights already or are they recovering? [14:42:38] Krinkle: we often see that when we do db switchovers [14:42:41] <_joe_> mutante: go ahead, lmk if you want to confirm the conftool query [14:42:58] things are recovering [14:43:11] TODO: consider adding relative time to logstash (e.g. "delta: 1s into the request") [14:43:11] !log jayme@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-run-puppet-on-db-masters (exit_code=0) [14:43:14] right now it is only five servers that look saturated on workers [14:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:23] <_joe_> yes [14:43:29] mw2405 mw2285 mw2286 mw2284 mw2261 [14:43:32] <_joe_> the mystery of mw2383 is unsolved [14:43:50] <_joe_> mw2286 isn't looking terrible either [14:43:56] marostegui, kormat: FYI 08-run-puppet-on-db-masters is done, you should be good to un-downtime or let it expire at your preference [14:44:12] rzl: I am fine with leaving it expire [14:44:13] <_joe_> and I think it's just matter of rebalancing the weights at this point [14:44:13] rzl: there's no safe way to un-downtime so v0v [14:44:19] yeah most of the slow ones are the ones that should be 25. mw2405 and mw2383 are not that, though. [14:44:20] !log dzahn@cumin1001 conftool action : set/weight=25; selector: name=mw225[1-8].codfw.wmnet,service=api_appserver [14:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:38] <_joe_> jayme: please start maintenance [14:44:40] !log jayme@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance [14:44:44] apart from mw2383.codfw.wmnet, all other hosts have a sensible number of idle workers [14:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:51] jayme: same for the tendril-tree step [14:44:56] ack [14:44:58] my home internet has picked a fun time to be flakey [14:45:01] <_joe_> mutante: please also do all the other servers below 2290 in api I would say [14:45:10] and in appservers? [14:45:20] ACK, do I need to set canary service back to 1 afterwards [14:45:22] !log dzahn@cumin1001 conftool action : set/weight=25; selector: name=mw225[1-8].codfw.wmnet [14:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:28] the API servers are little behind though [14:45:32] app servers look ok [14:46:06] load.php latency has recovered back to 95% of where it was an hour ago. the impact was comparable to a train deployment. mostly warm as it should be. [14:46:15] eqiad had a 25-vs-30 weight diff in appservers as well [14:46:16] !log dzahn@cumin1001 conftool action : set/weight=25; selector: name=mw22[7-8][0-9].codfw.wmnet [14:46:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:24] <_joe_> bblack: appservers get little enough traffic it's not that relevant [14:46:41] bblack: yea, but they are not exactly the same hardware generations matching perfectly [14:46:49] agree we should do it eventually just for consistency, but also agree it's not urgent [14:46:51] s/95%/105%/ [14:46:52] _joe_ jayme I am going to merge https://gerrit.wikimedia.org/r/c/operations/dns/+/702128/ [14:46:54] !log jayme@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0) [14:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:59] mutante: also mw226[12] + mw2290 for api [14:47:00] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30028/console" [puppet] - 10https://gerrit.wikimedia.org/r/701056 (https://phabricator.wikimedia.org/T285374) (owner: 10Vgutierrez) [14:47:02] !log dzahn@cumin1001 conftool action : set/weight=25; selector: name=mw2290.codfw.wmnet [14:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:30] !log dzahn@cumin1001 conftool action : set/weight=25; selector: name=mw226[1-2].codfw.wmnet [14:47:32] !log jayme@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-update-tendril [14:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:34] bblack: ACK done [14:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:42] <_joe_> marostegui: go [14:47:46] !log jayme@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-update-tendril (exit_code=0) [14:47:49] (03CR) 10Marostegui: [C: 03+2] wmnet: Change masters cnames [dns] - 10https://gerrit.wikimedia.org/r/702128 (https://phabricator.wikimedia.org/T281515) (owner: 10Marostegui) [14:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:04] done [14:48:05] weight looks as-intended here: https://config-master.wikimedia.org/pybal/codfw/api-https [14:48:12] cookbooks done [14:48:21] Wikidata change dispatching seems to have resumed, yay [14:48:26] but yeah 25 is a guess based on eqiad having that number, as noted there's different hardware involved here [14:48:42] jayme: tendril looking good, apart from pc/x2 hosts that need to be done manually (cc kormat ) [14:48:59] avg response for GET both in api and app are within acceptable limits [14:49:03] the best estimate is to go by batches in netbox, searchbing by procurement date, bblack [14:49:21] <_joe_> bblack: now it's matter of someone looking at the cpu usage on the various api servers, trying a restart, then adjusting weights if it can't be fixed [14:49:31] yeah [14:49:32] marostegui: do you do whatever needs to be done there? :) [14:49:34] <_joe_> effie: I'd depool mw2383, do you agree? [14:49:40] jayme: yeah, we'll take care of that [14:49:44] my rough quick cutoff was noting the different in /proc/cpuinfo :) [14:49:46] jayme: yep, on it. [14:49:47] I am setting the canary services back to weight 1 [14:49:48] _joe_: one minute to check one more thing [14:49:53] and depool it [14:49:53] depooling 2383 for study sgtm [14:49:58] marostegui: kormat: nice, thanks! [14:50:00] edits are at about 60% of where they were an hour ago (600 vs 1000 /min) [14:50:14] marostegui: i think i should probably do x2, too [14:50:15] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:50:21] kormat: +1 [14:50:27] Krinkle: I think it's because of Wikidata dispatch lag stopping bots [14:50:33] yup [14:50:36] (03CR) 10Elukey: "A lot of people were happy about having info related to their session (kinit vs active session etc..), if we remove this we should think a" [puppet] - 10https://gerrit.wikimedia.org/r/701512 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff) [14:50:42] https://grafana.wikimedia.org/d/TUJ0V-0Zk/wikidata-alerts?orgId=1&refresh=5s&from=now-1h&to=now shows just the Wikidata edit rate [14:50:43] should recover itself shortly now that dispatching has started again [14:50:45] cpu model 2290/2291 cutoff from cumin for reference: https://phabricator.wikimedia.org/P16740 [14:50:56] legoktm: also https://grafana.wikimedia.org/d/000000170/wikidata-edits?editPanel=2&orgId=1&refresh=1m&from=now-1h&to=now (has a panel for maxlag too) [14:51:35] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:51:40] ty [14:51:49] (03PS3) 10Muehlenhoff: Convert sretest-logout.py to wmflib.idm [puppet] - 10https://gerrit.wikimedia.org/r/702133 [14:51:55] legoktm: https://grafana-rw.wikimedia.org/d/000000429/backend-save-timing-breakdown?viewPanel=29&orgId=1&from=now-3h&to=now [14:52:00] _joe_ rzl I am depooling mw2383 unless you already have [14:52:04] ack, non-bots are same level again [14:52:09] effie: go ahead [14:52:23] !log depool mw2383 as it is misbehaving [14:52:26] !log dzahn@cumin1001 conftool action : set/weight=1; selector: name=mw225[1-2].codfw.wmnet,service=canary [14:52:27] Wikidata dispatching has finished catching up, bots should start editing again [14:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:35] <_joe_> effie: go :) [14:52:53] !log dzahn@cumin1001 conftool action : set/weight=1; selector: name=mw227[8-9].codfw.wmnet,service=canary [14:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:17] yup, wikidata-edits dashboard looks like it’s back to normal [14:53:22] marostegui: tendril update complete, including x2. [14:53:25] kormat: parsercache purge won't start until ~9h from now. shall I start a tagged manual run on pc1/2/3 from codfw now? [14:53:32] kormat: thanks :** [14:53:37] !log dzahn@cumin1001 conftool action : set/weight=1; selector: name=mw227[1-2].codfw.wmnet,service=canary [14:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:49] SGrabarczuk: there may be some lingering slowness but I think most things should be back to normal now [14:53:55] Krinkle: i'd wait for the dust to settle first re: switchover [14:54:01] Krinkle: let's give time for things to calm down [14:54:13] thx legoktm [14:55:17] bblack: this confirms the hardware gens. https://phabricator.wikimedia.org/T247021#5947061 [14:55:22] 10SRE, 10SRE-OnFire, 10observability: SRE team has a good understanding of how & when to declare an outage on the status page; & it is easy to do so - https://phabricator.wikimedia.org/T285769 (10CDanis) [14:55:25] <_joe_> legoktm: I see no sign of slowness right now [14:55:41] urbanecm: btw, nice job with that trwikivoyage issue, appreciate the quick fix [14:55:57] happy i could help :) [14:55:59] indeed! thanks urbanecm [14:56:11] <_joe_> apis still have a slightly high 99th percentile [14:56:18] <_joe_> but well within reason AFAICT [14:56:23] <_joe_> is icinga happy? [14:56:55] icinga has decided it has 622 pending checks to run against db hosts [14:57:03] 620 PEnding [14:57:09] (03CR) 10BryanDavis: [C: 03+1] tools-clush: remove paws from clush and add the rest of the k8s setup [puppet] - 10https://gerrit.wikimedia.org/r/701975 (https://phabricator.wikimedia.org/T280299) (owner: 10Bstorm) [14:57:09] it still has the api_appserver alerts [14:57:12] _joe_: pretty much all the stuff showing up there, is older than the dc switch [14:57:23] mw2380 has SMART error [14:57:29] but that's it [14:57:44] mutante: but that's from 2 days ago [14:57:47] sorry, where do you see how many pending checks? [14:57:49] yea, that broke 2 days ago [14:58:02] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&type=detail&servicestatustypes=1 [14:58:20] legoktm: it's a bit unfortunate because it's a frameset and you might just get a part of it [14:58:23] RECOVERY - Check systemd state on sretest1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:58:43] the blue icon on top of https://icinga.wikimedia.org/icinga/ [14:58:54] but depending from where we link it might not show the menu [14:59:16] is this an affect of the switchover? or just icinga things? [14:59:24] just icinga things [14:59:33] and how we link to it from alerts.wm.org or so [14:59:37] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:59:48] _joe_: did you restart php-fpm on 2383? [14:59:54] oh, the part that the checks are PENDING is related to the sswitch, yes [15:00:16] because puppet logic tells it to create checks in non-active DC or so [15:00:22] _joe_: nvm [15:00:43] gotcha [15:01:33] legoktm: optionally we can mark them all and click "reschedule next service check" for "now" [15:01:44] but it's a bit mean to icinga load [15:02:11] I'd rather just leave it alone if it'll catch up naturally in a reasonable amount of time [15:02:29] *nod* [15:02:36] <_joe_> read-only time was 1m57s [15:02:55] <_joe_> and we had a much shorter high-latnecy period compared to last time [15:02:55] legoktm: reasonable time ~ 17 min [15:02:59] <_joe_> and basically no issues [15:02:59] new record? [15:03:22] <_joe_> wkandek: I don't think it is, but we're now at the point where it's pointless to count [15:03:31] <_joe_> we didn't automate more stuff this time [15:03:37] <_joe_> as we're out of things to automate :P [15:03:42] yeah, it's about the same range as last time and we decided to retire the contest I think [15:03:59] <_joe_> it makes sense to measure if we're changing the procedure [15:04:01] I’m still getting “search is too busy” errors in the Android app fairly frequently (maybe ~25% of searches) [15:04:06] <_joe_> else we're just being reckless [15:04:10] ebernhardson: ^^ [15:04:19] <_joe_> also gehel ^^ [15:04:33] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=DELETE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:04:35] Lucas_WMDE: I'm looking into it [15:04:37] ok [15:04:53] I think we *could* automate the "run steps two through seven without stopping unless there's a problem" and probably cut 15-20 seconds off, but it doesn't seem urgent [15:05:10] 2380 is pooled with degraded RAID, should we depool that too? [15:05:33] (03CR) 10Bstorm: [C: 03+2] tools-clush: remove paws from clush and add the rest of the k8s setup [puppet] - 10https://gerrit.wikimedia.org/r/701975 (https://phabricator.wikimedia.org/T280299) (owner: 10Bstorm) [15:05:47] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:06:18] !log banning elastic2054 [15:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:27] <_3_> mutante: set it to inactive though if it's broken [15:06:54] _joe_: it's still working just not redundant disks [15:07:11] will have to be depooled once dcops replace disk (only) [15:08:08] (03PS1) 10Arturo Borrero Gonzalez: toolforge: enable jobs-api [puppet] - 10https://gerrit.wikimedia.org/r/702139 (https://phabricator.wikimedia.org/T283238) [15:08:48] if we're in good shape I'll head offline again [15:08:59] see some of you most week, the rest the week after <3 great job everyone [15:09:09] *some of you next week, that is [15:09:16] rzl: thanks!! o/ [15:09:17] thanks rzl o/ [15:09:18] rzl: good luck hunting [15:11:20] (03PS2) 10Arturo Borrero Gonzalez: toolforge: enable jobs-api [puppet] - 10https://gerrit.wikimedia.org/r/702139 (https://phabricator.wikimedia.org/T283238) [15:11:22] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: drop ingress configuration for jobs-api [puppet] - 10https://gerrit.wikimedia.org/r/702141 [15:11:31] ACKNOWLEDGEMENT - Device not healthy -SMART- on mw2380 is CRITICAL: cluster=jobrunner device=sda instance=mw2380 job=node site=codfw daniel_zahn https://phabricator.wikimedia.org/T285603 https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mw2380&var-datasource=codfw+prometheus/ops [15:12:16] mw-log-cleanup.service on mwlog2002 is failed but that is also since 13h [15:12:27] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at codfw on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.0625 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [15:13:38] just one issue: [15:13:40] CRITICAL: Wikidata Alerts ( https://grafana.wikimedia.org/d/TUJ0V-0Zk/wikidata-alerts ) is alerting: Wikidatawiki Dispatch Problem. [15:13:45] that is since 42 min [15:14:03] but on alert1001, not 2001 [15:14:04] Lucas_WMDE can confirm, but I think the dispatch is working correctly, but maybe the alert is off? [15:14:21] seems like the check needs some logic to failover to codfw, maybe [15:14:48] pretty sure dispatching is working, but I can test it, one moment [15:15:07] [mwlog2002:~] $ sudo systemctl status mw-log-cleanup [15:15:15] Active: failed (Result: exit-code) since Tue 2021-06-29 02:00:21 UTC; 13h ago [15:15:18] starting that [15:15:46] !log [mwlog2002:~] $ sudo systemctl start mw-log-cleanup [15:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:53] mutante, legoktm: can confirm, change dispatching is working in practice [15:15:58] I’m afraid I don’t know much about that alert [15:16:45] RECOVERY - Check systemd state on mwlog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:06] ^ recovery was starting that service [15:17:20] recovery from the Wikidata dispatch alert? [15:17:22] Lucas_WMDE: ACK, thanks for checking. might need a small follow-up for monitoring itself [15:17:34] Lucas_WMDE: no, unrelated issue on mwlog host [15:17:37] ok [15:17:46] !log pool mw2383 back [15:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:56] I'm trying to figure out why the dispatch monitoring is off [15:18:27] search is recovering, looks like a misbehaving node that will need to be investigated [15:18:34] <_joe_> legoktm: I guess it's checking something in eqiad [15:18:50] <_joe_> dcausse: thanks so it was mostly unrelated to the switchover? [15:19:01] reduced unhandled Icinga CRITs to 4, one is a logstash.mgmt, one is the netbox cable report, one is db1124 but has disabled notificatins, last one is the wikidata dispatch thing.. ALL CLEAR on Icinga [15:19:17] mutante: I will take care of db1124 (it is a test host anyways) [15:19:23] marostegui: thanks! [15:19:52] _joe_: yes I believe so [15:20:06] ACKNOWLEDGEMENT - SSH on logstash2021.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:20:17] <_joe_> dcausse: ack thanks that's important for us [15:20:23] https://gerrit.wikimedia.org/g/operations/puppet/+/c66ae7f1b247f6ba201c1acc5d24de4493dfad50/modules/icinga/manifests/monitor/wikidata.pp#43 [15:20:41] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: drop ingress configuration for jobs-api [puppet] - 10https://gerrit.wikimedia.org/r/702141 (owner: 10Arturo Borrero Gonzalez) [15:20:48] <_joe_> legoktm: I guess it's looking at the wrong dashboard? [15:20:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: enable jobs-api [puppet] - 10https://gerrit.wikimedia.org/r/702139 (https://phabricator.wikimedia.org/T283238) (owner: 10Arturo Borrero Gonzalez) [15:21:23] no it's the right one...or at least the dashboard it's pointing to has no alerts on it [15:21:29] <_joe_> legoktm: oh no I think it';;s alerting because of https://grafana.wikimedia.org/d/TUJ0V-0Zk/wikidata-alerts?viewPanel=12&orgId=1&refresh=5s&from=now-1h&to=now [15:21:40] <_joe_> it was alerting in the last hour and that check is slow to react [15:21:50] hm, could be [15:22:03] or possibly https://grafana.wikimedia.org/d/TUJ0V-0Zk/wikidata-alerts?viewPanel=10&orgId=1&refresh=5s (s8 rows read)? [15:22:03] <_joe_> oh no wait [15:22:10] I assumed that as long as https://grafana.wikimedia.org/d/TUJ0V-0Zk/wikidata-alerts?orgId=1&refresh=5s "Alert List (not green)" was empty, it would be happy [15:22:12] <_joe_> we just don't have the data for the wikibase lag [15:22:14] <_joe_> wait [15:22:16] It's just gone ok according to wikidata-feed [15:23:20] ACK, it recovered. all green. also icinga warning level, nothing is newer than 6h [15:23:21] <_joe_> yes, https://grafana.wikimedia.org/goto/4hvMqhznk [15:23:45] <_joe_> so after the switchover, somehow [15:23:58] <_joe_> we get the dispatch lag for wikibase-queryservice [15:24:11] <_joe_> but not for wikibase-dispatching [15:24:18] <_joe_> Lucas_WMDE: any idea how that might happen? [15:24:42] _joe_: it's ran by cron according to the docs [15:24:47] <_joe_> did anyone check that dispatching *is* runing on mwmaint2002? [15:24:53] (03PS1) 10Arturo Borrero Gonzalez: toolforge: haproxy: more jobs_api cleanup [puppet] - 10https://gerrit.wikimedia.org/r/702142 [15:24:56] <_joe_> RhinosF1: I am aware, thanks [15:25:14] _joe_: yes, I checked [15:25:30] _joe_: no idea what’s going on in that chart [15:25:31] 16s ago mediawiki_job_wikidata-updateQueryServiceLag.timer [15:25:32] <_joe_> legoktm: is it still running via cron right? [15:25:36] yep [15:25:40] <_joe_> mutante: that's wdqs [15:25:43] ah, ok [15:25:43] 130522 130523 130522 130522 ? -1 S 33 0:00 \_ /bin/bash /usr/local/bin/mwscript extensions/Wikibase/repo/maintenance/dispatchChanges.php --wiki wikidatawiki [15:25:44] (03CR) 10Ottomata: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/696348 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [15:26:15] <_joe_> and yes it's running [15:26:21] <_joe_> ok not a serious issue then [15:26:31] <_joe_> I'd open a bug to wikibase maybe [15:26:42] new alert for citoid timeout in eqiad.. just eqiad [15:26:54] so the wikibase-dispatching lag continued being tracked just long enough that it recorded the drop back to normal (after dispatching resumed) [15:26:55] and ..it's gone again, heh [15:26:58] and *then* vanished into thin air? [15:27:00] <_joe_> mutante: I think we don't need to look further [15:27:13] <_joe_> Lucas_WMDE: apparenlty [15:27:18] _joe_: yep, done [15:27:22] could it be that that chart only ever shows the highest of the three lags? [15:27:30] because in that timeframe I don’t see db either even though it appears in the legend [15:27:46] <_joe_> Lucas_WMDE: ooh damn [15:27:50] <_joe_> yes [15:27:53] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: haproxy: more jobs_api cleanup [puppet] - 10https://gerrit.wikimedia.org/r/702142 (owner: 10Arturo Borrero Gonzalez) [15:28:10] or at least over the last 6 hours, I also see some lines starting and stopping seemingly at random [15:28:27] <_joe_> but no [15:28:39] <_joe_> it's not just the max [15:28:43] not quite [15:28:45] <_joe_> it's sorted by max lag [15:28:57] but it does feel like lines appear and disappear as I zoom in and out on the time… [15:29:21] <_joe_> yes they do [15:29:33] <_joe_> I have no idea how that thing is calculated though [15:30:06] I added it to my notes, it might just be that we need better docs around this [15:30:53] !log restarting blazegraph on wdqs1012 [15:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:15] _joe_: replacing dispatching with jobs is the highest priority in tech maintenance so I think it'll be picked up and replaced altogether soon [15:32:23] if you ask me, no need to create ticket [15:33:44] PROBLEM - WDQS high update lag on wdqs1012 is CRITICAL: 4.453e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [15:35:22] dcausse: what's wrong with wdqs? [15:35:23] (03PS2) 10Jbond: postgress::user: Add new grant resource [puppet] - 10https://gerrit.wikimedia.org/r/702134 (https://phabricator.wikimedia.org/T285666) [15:35:40] we got alert mail that new ports are open on apt2001, ferm rule logic was tied to active DC [15:35:49] but apt.wikimedia.org is apt1001 as normal [15:35:49] legoktm: blazegraph sometimes deadlock and has to be restarted (known issue) [15:35:54] (03CR) 10jerkins-bot: [V: 04-1] postgress::user: Add new grant resource [puppet] - 10https://gerrit.wikimedia.org/r/702134 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [15:39:26] dcausse: ack, thanks for explaining :) [15:39:45] (03PS1) 10Ladsgroup: statistics: Drop rsync_job [puppet] - 10https://gerrit.wikimedia.org/r/702145 [15:40:12] (03CR) 10Ladsgroup: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/696348 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [15:40:32] (03Abandoned) 10Ladsgroup: statistics: Migrate cron to systemd timer in rsync_job [puppet] - 10https://gerrit.wikimedia.org/r/696348 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [15:42:07] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): June 2021 Datacenter switchover - https://phabricator.wikimedia.org/T281515 (10Legoktm) The switchover is mostly complete now, we were read only from 2021-06-29 14:21:26.671853 to 2021-06-29 14:23:23.504447, or 1m57s. The raw notes... [15:43:26] !log unbanning elastic2054 [15:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:08] (03PS3) 10Jbond: postgress::user: Add new grant resource [puppet] - 10https://gerrit.wikimedia.org/r/702134 (https://phabricator.wikimedia.org/T285666) [15:46:39] (03CR) 10jerkins-bot: [V: 04-1] postgress::user: Add new grant resource [puppet] - 10https://gerrit.wikimedia.org/r/702134 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [15:49:44] (03PS1) 10Ahmon Dancy: README.md: Trivial change to test deployment pipeline [core] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702147 [15:50:01] elukey: this still needed? apt1001 (1371 minutes ago). Puppet is disabled. elukey - testing [15:51:05] (03PS4) 10Jbond: postgress::user: Add new grant resource [puppet] - 10https://gerrit.wikimedia.org/r/702134 (https://phabricator.wikimedia.org/T285666) [15:51:29] mutante: not at all, I thought I had re-enabled it, my bad [15:51:34] (03CR) 10jerkins-bot: [V: 04-1] postgress::user: Add new grant resource [puppet] - 10https://gerrit.wikimedia.org/r/702134 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [15:51:36] the new deployment host is deployment.codfw.wmnet, right? I get a fingerprint from it that doesn’t match https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/deploy2002.codfw.wmnet … [15:51:50] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30031/console" [puppet] - 10https://gerrit.wikimedia.org/r/702134 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [15:52:15] Lucas_WMDE: deployment.codfw.wmnet.300INCNAMEdeploy1002.eqiad.wmnet. [15:52:25] elukey: thanks! re-enabling and running puppet.. I was just wondering why apt2001 changed some ferm config when we switched DCs, dont see the code doing that. [15:52:28] argh sorry horrible paste [15:52:34] it’s readable enough ^^ [15:52:51] is it correct? we should still deploy from deploy1002, not 2002? [15:53:06] (mainly asking as a backport+config deployer in this case) [15:53:24] legoktm: --^ (I am not sure, asking to an authoritative source :) [15:53:44] Lucas_WMDE: deployment server has not changed [15:53:46] Lucas_WMDE, elukey: as train deployer i was just wondering the same thing; my current understanding is that deploy server doesn't change, but i'm prepared to be wrong. [15:53:47] as long as deploy2002 has that "DO NOT USE THE HOST" motd banner (it does), the answer should be yes [15:53:56] ok thanks [15:54:16] ^ [15:54:18] so I might as well continue SSHing into deployment.*eqiad*.wmnet I assume [15:54:30] yep [15:55:14] (03PS1) 10Jbond: puppetdb: assign ro permissions to puppetdb_ro user [puppet] - 10https://gerrit.wikimedia.org/r/702150 (https://phabricator.wikimedia.org/T285666) [15:55:15] btw: it's /puppet$ grep deployment_server hieradata/common.yaml [15:55:59] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30032/console" [puppet] - 10https://gerrit.wikimedia.org/r/702150 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [15:56:06] legoktm: icinga pending is finally done [15:56:08] we need to schedule a time to switch the deployment server, but I was waiting for after today to do that :p [15:56:09] and it looks like the maintenance server is mwmaint2002, not …2001 [15:56:17] correct [15:56:17] I’ll have to see where on wikitech I found 2001 [15:57:02] [deploy1002:~] $ host mwmaint.discovery.wmnet [15:57:02] mwmaint.discovery.wmnet is an alias for mwmaint1002.eqiad.wmnet. [15:57:05] Lucas_WMDE: ^ [15:57:05] I suspect I inferred 2001 from the fact that https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints has no 2002 subpage yet [15:58:25] alright, https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers/Script updated [15:58:30] (03PS5) 10Jbond: postgress::user: Add new grant resource [puppet] - 10https://gerrit.wikimedia.org/r/702134 (https://phabricator.wikimedia.org/T285666) [15:58:44] mutante: you mean I should use 1002 not 2002? [15:59:00] (03PS2) 10Jbond: puppetdb: assign ro permissions to puppetdb_ro user [puppet] - 10https://gerrit.wikimedia.org/r/702150 (https://phabricator.wikimedia.org/T285666) [15:59:05] (03CR) 10jerkins-bot: [V: 04-1] postgress::user: Add new grant resource [puppet] - 10https://gerrit.wikimedia.org/r/702134 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [15:59:43] Lucas_WMDE: No, I wanted to say "there is an alias for that" so you can forget about numbers, just like for deployment, but ... that DNS name is not actually switched yet.. hmm [16:00:05] jbond42 and cdanis: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210629T1600). [16:00:06] ok, but the “discovery” in that alias makes me wonder if it’s “meant for me” ^^ [16:00:15] (03PS6) 10Jbond: postgress::user: Add new grant resource [puppet] - 10https://gerrit.wikimedia.org/r/702134 (https://phabricator.wikimedia.org/T285666) [16:00:21] legoktm: ^ should we change the discovery name for mwmaint.. guess so [16:00:31] (03PS3) 10Jbond: puppetdb: assign ro permissions to puppetdb_ro user [puppet] - 10https://gerrit.wikimedia.org/r/702150 (https://phabricator.wikimedia.org/T285666) [16:02:37] (03PS7) 10Jbond: postgress::user: Add new grant resource [puppet] - 10https://gerrit.wikimedia.org/r/702134 (https://phabricator.wikimedia.org/T285666) [16:02:45] (03PS4) 10Jbond: puppetdb: assign ro permissions to puppetdb_ro user [puppet] - 10https://gerrit.wikimedia.org/r/702150 (https://phabricator.wikimedia.org/T285666) [16:03:29] mutante: does this mean we can finally call the stretch to buster for mediawiki done? [16:03:53] Amir1: not right now but tomorrow or so.. I have to reimage the one that is not active now [16:04:03] cooool [16:04:04] Awesome [16:04:08] Amazing [16:04:10] mutante: ok, that was longer than I expected, I'll add a note, it's probably not great monitoring was lagged for so long [16:04:23] (03PS1) 10Brennen Bearnes: Merge branch 'master' into wmf_deploy [extensions/CentralNotice] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702166 [16:04:36] legoktm: it was "just" for DB hosts though, but yes [16:04:59] (03CR) 10Brennen Bearnes: [C: 03+2] Merge branch 'master' into wmf_deploy [extensions/CentralNotice] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702166 (owner: 10Brennen Bearnes) [16:05:08] i think we should change mwmaint.discovery.wmnet.. pulling DNS repo [16:05:45] (03CR) 10Brennen Bearnes: [C: 03+2] README.md: Trivial change to test deployment pipeline [core] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702147 (owner: 10Ahmon Dancy) [16:06:29] legoktm: Lucas_WMDE: ooh,, I remmeber now.. so it's complicated. there is an "active mwmaint server" for "where the crons run" and then there is another "active mwmaint server" for "where noc.wm.org runs" [16:06:34] (03CR) 10Filippo Giunchedi: [C: 03+2] mtail: use non-deprecated log.warning [puppet] - 10https://gerrit.wikimedia.org/r/702110 (https://phabricator.wikimedia.org/T285534) (owner: 10Filippo Giunchedi) [16:06:39] the discovery name is for the webserver part [16:07:02] which is https://phabricator.wikimedia.org/T265936 for making that active-active [16:07:20] (03CR) 10Filippo Giunchedi: [C: 03+2] mtail: parse 3.0.0~rc43 store format [puppet] - 10https://gerrit.wikimedia.org/r/702116 (https://phabricator.wikimedia.org/T285534) (owner: 10Filippo Giunchedi) [16:07:21] yea, exactly [16:07:34] (03PS3) 10Filippo Giunchedi: mtail: parse 3.0.0~rc43 store format [puppet] - 10https://gerrit.wikimedia.org/r/702116 (https://phabricator.wikimedia.org/T285534) [16:07:37] Lucas_WMDE: for today.. just use mwmaint2002 but in the long run we will make this better :p [16:07:42] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] mtail: parse 3.0.0~rc43 store format [puppet] - 10https://gerrit.wikimedia.org/r/702116 (https://phabricator.wikimedia.org/T285534) (owner: 10Filippo Giunchedi) [16:07:43] ok thanks ^^ [16:08:07] I updated my script and put it on wiki, idk if anyone else actually uses it :P [16:08:49] (03PS8) 10Jbond: postgress::user: Add new grant resource [puppet] - 10https://gerrit.wikimedia.org/r/702134 (https://phabricator.wikimedia.org/T285666) [16:09:02] cool! [16:09:12] (03PS5) 10Jbond: puppetdb: assign ro permissions to puppetdb_ro user [puppet] - 10https://gerrit.wikimedia.org/r/702150 (https://phabricator.wikimedia.org/T285666) [16:09:58] wow, super nifty [16:10:16] (03Merged) 10jenkins-bot: Merge branch 'master' into wmf_deploy [extensions/CentralNotice] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702166 (owner: 10Brennen Bearnes) [16:10:52] Lucas_WMDE: note that there's nothing stopping you from sending mwdebug requests to eqiad servers, it'll just be a bit slower since all the services are in codfw [16:11:16] ok, good to know [16:11:41] legoktm: there is, it'll talk to (RO) eqiad databases [16:11:49] oh, duh [16:11:57] you need to use debug srvs from active DC to have writable MW [16:12:06] yep yep, nvm [16:12:18] (03PS1) 10Brennen Bearnes: Merge branch 'master' into wmf_deploy [extensions/CentralNotice] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702167 [16:12:41] okay, so edits won’t work [16:14:19] (03PS9) 10Jbond: postgress::user: Add new grant resource [puppet] - 10https://gerrit.wikimedia.org/r/702134 (https://phabricator.wikimedia.org/T285666) [16:14:39] (03PS6) 10Jbond: puppetdb: assign ro permissions to puppetdb_ro user [puppet] - 10https://gerrit.wikimedia.org/r/702150 (https://phabricator.wikimedia.org/T285666) [16:15:07] (03PS1) 10Ahmon Dancy: Trigger update-train-versions job at end of wmf-publish pipeline [core] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702168 [16:15:42] (03CR) 10Ahmon Dancy: [C: 03+2] Trigger update-train-versions job at end of wmf-publish pipeline [core] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702168 (owner: 10Ahmon Dancy) [16:17:28] 10SRE, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to analytics cluster for btullis - https://phabricator.wikimedia.org/T285754 (10Ottomata) a:03razzi Ben's first day was yesterday, so let's expedite this! @razzi will take care of this, and I will follow up with SRE on enabling root a... [16:17:50] 10SRE, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to analytics cluster for btullis - https://phabricator.wikimedia.org/T285754 (10Ottomata) Oh, I think `analytics-admins` is not needed since Ben will be an SRE and have root access, editing task description. [16:18:06] 10SRE, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to analytics cluster for btullis - https://phabricator.wikimedia.org/T285754 (10Ottomata) [16:18:17] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM and very nice work Chris. I would defer to John or the others on whether there may be further code improvements possible, certainly " [software/statograph] - 10https://gerrit.wikimedia.org/r/701599 (https://phabricator.wikimedia.org/T285569) (owner: 10CDanis) [16:19:31] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/702123 (https://phabricator.wikimedia.org/T285456) (owner: 10Dzahn) [16:21:00] (03CR) 10Jbond: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/702126 (https://phabricator.wikimedia.org/T285456) (owner: 10Dzahn) [16:22:27] (03CR) 10Jbond: [C: 03+1] "not tested but lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/702133 (owner: 10Muehlenhoff) [16:24:49] (03Merged) 10jenkins-bot: README.md: Trivial change to test deployment pipeline [core] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702147 (owner: 10Ahmon Dancy) [16:27:04] (03PS10) 10Jbond: postgress::user: Add new grant resource [puppet] - 10https://gerrit.wikimedia.org/r/702134 (https://phabricator.wikimedia.org/T285666) [16:27:32] (03PS7) 10Jbond: puppetdb: assign ro permissions to puppetdb_ro user [puppet] - 10https://gerrit.wikimedia.org/r/702150 (https://phabricator.wikimedia.org/T285666) [16:27:48] (03PS3) 10Bartosz Dziewoński: DiscussionTools: Enable new topic tool by default on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656572 (https://phabricator.wikimedia.org/T272077) (owner: 10Esanders) [16:28:09] (03CR) 10Brennen Bearnes: [C: 03+2] Merge branch 'master' into wmf_deploy [extensions/CentralNotice] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702167 (owner: 10Brennen Bearnes) [16:28:21] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30039/console" [puppet] - 10https://gerrit.wikimedia.org/r/702150 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [16:28:59] !log temporarily ban elastic2045 from production-search-codfw [16:29:03] (03PS4) 10Bartosz Dziewoński: DiscussionTools: Enable new topic tool by default on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656572 (https://phabricator.wikimedia.org/T272077) (owner: 10Esanders) [16:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:25] hello, is anyone around who could merge https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/656572 for me (a beta-cluster-only change), or should i schedule it properly? [16:30:32] (03CR) 10Jbond: [C: 03+2] postgress::user: Add new grant resource [puppet] - 10https://gerrit.wikimedia.org/r/702134 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [16:31:16] MatmaRex: patch looks good. Happy to merge it, provided that's okay by SREs (topic still says "switchover in process") [16:32:03] (03PS8) 10Jbond: puppetdb: assign ro permissions to puppetdb_ro user [puppet] - 10https://gerrit.wikimedia.org/r/702150 (https://phabricator.wikimedia.org/T285666) [16:32:26] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656572 (https://phabricator.wikimedia.org/T272077) (owner: 10Esanders) [16:35:01] (03CR) 10Jbond: [C: 03+2] puppetdb: assign ro permissions to puppetdb_ro user [puppet] - 10https://gerrit.wikimedia.org/r/702150 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [16:35:28] urbanecm: go for it [16:35:34] (03CR) 10Urbanecm: [C: 03+2] DiscussionTools: Enable new topic tool by default on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656572 (https://phabricator.wikimedia.org/T272077) (owner: 10Esanders) [16:35:35] p [16:35:38] thanks legoktm [16:36:21] thanks [16:36:40] (03Merged) 10jenkins-bot: DiscussionTools: Enable new topic tool by default on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656572 (https://phabricator.wikimedia.org/T272077) (owner: 10Esanders) [16:37:02] any time MatmaRex [16:37:10] (03Merged) 10jenkins-bot: Trigger update-train-versions job at end of wmf-publish pipeline [core] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702168 (owner: 10Ahmon Dancy) [16:37:13] (03Merged) 10jenkins-bot: Merge branch 'master' into wmf_deploy [extensions/CentralNotice] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702167 (owner: 10Brennen Bearnes) [16:37:50] now the good question is...which deployment srv do i use [16:38:07] urbanecm: same one, it didn't switch [16:38:15] good to know [16:38:24] I'll send an email since you're not the first to ask [16:38:36] hehe [16:38:46] that means i don't monitor the scrollback closely enough [16:38:55] good call, and maybe anything else that people use frequently that didn't switch [16:39:56] speaking of deployment servers, the motd should be fixed. It says ` Connect to 'deploy1002.eqiad.wmnet' instead, it will route you to the correct server.`, which is not entriely true. It will always route me to deploy1002 (or at least that's my understanding). [16:40:10] legoktm: appreciated, re: email. [16:40:15] urbanecm: which motd? [16:40:16] (03PS1) 10Jbond: postgreslq::db_grants: correct order of sql statment [puppet] - 10https://gerrit.wikimedia.org/r/702154 [16:40:29] legoktm: deploy2002. trying to locate it in puppet [16:40:32] https://wikitech.wikimedia.org/w/index.php?title=Deployment_server&type=revision&diff=1917183&oldid=1906582 [16:40:39] (03CR) 10Ottomata: [C: 03+1] "thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/702145 (owner: 10Ladsgroup) [16:40:56] (03CR) 10Jbond: [C: 03+2] postgreslq::db_grants: correct order of sql statment [puppet] - 10https://gerrit.wikimedia.org/r/702154 (owner: 10Jbond) [16:41:02] fyi https://wikitech.wikimedia.org/wiki/Switch_Datacenter/Not_switching [16:41:18] urbanecm: https://github.com/wikimedia/puppet/blob/f35c58a590bfbd0797444915b01888cbc0f83f4b/modules/profile/manifests/mediawiki/deployment/server.pp#L124 [16:41:39] thanks. Already found it though, now running git pull. Takes a while, didn't update puppet repo for quite some time :) [16:43:50] (03PS1) 10Jbond: postgresql::db_grant: use correct privlage [puppet] - 10https://gerrit.wikimedia.org/r/702157 [16:44:23] (03PS1) 10Urbanecm: deployment: Inactive motd implies the displayed name is an alias [puppet] - 10https://gerrit.wikimedia.org/r/702158 [16:44:26] 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Datacenter-Switchover: CommRel support for June 2021 Switchover - https://phabricator.wikimedia.org/T281209 (10sgrabarczuk) 05Open→03Resolved [16:44:29] (03CR) 10Jbond: [C: 03+2] postgresql::db_grant: use correct privlage [puppet] - 10https://gerrit.wikimedia.org/r/702157 (owner: 10Jbond) [16:44:30] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): June 2021 Datacenter switchover - https://phabricator.wikimedia.org/T281515 (10sgrabarczuk) [16:44:32] legoktm: for your consideration ^ 🙂 [16:45:14] !log 1.37.0-wmf.12 was branched at 3703c3194b590a1fcccb485245022eac369d2b69 for T281153 [16:45:16] mail sent [16:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:23] T281153: 1.37.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T281153 [16:45:55] maybe the intention was to put "deployment.eqiad.wmnet" there? [16:46:23] it used to be there [16:46:31] (03CR) 10Legoktm: [C: 03+2] deployment: Inactive motd implies the displayed name is an alias [puppet] - 10https://gerrit.wikimedia.org/r/702158 (owner: 10Urbanecm) [16:46:43] (but afaik people complained about ssh keys not being seamless :D) [16:47:31] right [16:47:34] thank you :) [16:47:49] np [16:47:55] thanks for the merge [16:48:58] puppet ran, logged into deploy2002, looks good :) [16:54:16] (03PS1) 10Brennen Bearnes: testwikis wikis to 1.37.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702159 [16:54:18] (03CR) 10Brennen Bearnes: [C: 03+2] testwikis wikis to 1.37.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702159 (owner: 10Brennen Bearnes) [16:54:58] (03Merged) 10jenkins-bot: testwikis wikis to 1.37.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702159 (owner: 10Brennen Bearnes) [16:55:03] !log brennen@deploy1002 Started scap: testwikis wikis to 1.37.0-wmf.12 [16:55:07] !log T281327 `[Cirrus -> codfw]` Current banned nodes are`elastic2043` and `elastic2045`; `elastic2043` can be unbanned after a re-image, and `elastic2045` can be unbanned in ~30 minutes after shards rebalance (had heavy shards scheduled) [16:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:15] T281327: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 [16:57:12] 10SRE, 10SRE-OnFire, 10observability, 10Patch-For-Review: Automated uploads of minimal & comprehensible timeseries metrics for statuspage display - https://phabricator.wikimedia.org/T285569 (10CDanis) As a fun curiosity, here's today's datacenter switchover as shown by Statuspage: {F34531856} [16:57:17] (03PS1) 10Jbond: puppetdb::db_grant: fix sql unsless statments [puppet] - 10https://gerrit.wikimedia.org/r/702161 [16:57:18] 10SRE, 10SRE-OnFire, 10observability, 10Patch-For-Review: Automated uploads of minimal & comprehensible timeseries metrics for statuspage display - https://phabricator.wikimedia.org/T285569 (10CDanis) As a fun curiosity, here's today's datacenter switchover as shown by Statuspage: {F34531856} [16:58:52] (03CR) 10Jbond: [C: 03+2] puppetdb::db_grant: fix sql unsless statments [puppet] - 10https://gerrit.wikimedia.org/r/702161 (owner: 10Jbond) [17:00:05] chrisalbon and accraze: May I have your attention please! Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210629T1700) [17:00:25] A little follow from the DC switch: I created a short page on how to share tmux sessions, also including a section on how to fix the size. I'm far from being a tmux expert. So if somebody with more tmux knowledge would like to take a look I would be very grateful. https://wikitech.wikimedia.org/wiki/Collaborative_tmux_sessions [17:00:35] I will link the page in the Switch Datacenter page as well [17:00:54] (03PS6) 10CDanis: statograph: Initial commit [software/statograph] - 10https://gerrit.wikimedia.org/r/701599 (https://phabricator.wikimedia.org/T285569) [17:01:54] jelto: oh that's awesome, thank you!! [17:05:09] 10SRE, 10SRE-OnFire, 10observability: SRE team has a good understanding of how & when to declare an outage on the status page; & it is easy to do so - https://phabricator.wikimedia.org/T285769 (10CDanis) [17:06:48] is "tmux sessions were shared" the reason why people said "terminal size" during the switchover from time to time? [17:07:07] yes [17:07:27] the way we had it setup was that the tmux would shrink to the size of the smallest window connected [17:08:01] but it seems going forward we can just set a window size [17:08:11] i see [17:08:15] nice idea either way [17:08:15] (03PS1) 10Jbond: postgresql::db_grants: quotes matter :S [puppet] - 10https://gerrit.wikimedia.org/r/702162 [17:08:23] (03PS1) 10Cwhite: kafkatee: send sampled-1000 webrequest logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/702163 [17:09:36] (03CR) 10Jbond: [C: 03+2] postgresql::db_grants: quotes matter :S [puppet] - 10https://gerrit.wikimedia.org/r/702162 (owner: 10Jbond) [17:13:49] 10SRE, 10SRE-OnFire, 10observability: Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10CDanis) >>! In T202061#7176114, @CDanis wrote: >>>! In T202061#7175886, @Legoktm wrote: >>>>! In T202061#7175872, @lmata wrote: >>> https://wikimedia.sta... [17:15:05] 10SRE, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to analytics cluster for btullis - https://phabricator.wikimedia.org/T285754 (10herron) [17:16:05] (03PS1) 10Jbond: postgresql: use correct unless priv for function check [puppet] - 10https://gerrit.wikimedia.org/r/702165 [17:16:23] 10SRE, 10SRE-OnFire, 10observability, 10Patch-For-Review: Automated uploads of minimal & comprehensible timeseries metrics for statuspage display - https://phabricator.wikimedia.org/T285569 (10CDanis) To aid in the Puppetization and deployment: P16741 is a private-to-SRE paste that contains a configuration... [17:18:27] 10SRE, 10ops-codfw: Degraded RAID on mw2380 - https://phabricator.wikimedia.org/T285603 (10Papaul) p:05Triage→03Medium a:03Papaul [17:19:49] (03CR) 10Jbond: [C: 03+1] "I havn;t had a chance to go over this again but as this is already functioning i suggest we merge this now so we can start parasitising a" [software/statograph] - 10https://gerrit.wikimedia.org/r/701599 (https://phabricator.wikimedia.org/T285569) (owner: 10CDanis) [17:19:54] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Doing): Check out www-portals repo in the mediawiki-webserver and in the mediawiki-multiversion images - https://phabricator.wikimedia.org/T285325 (10thcipriani) p:05Triage→03Medium [17:20:05] (03CR) 10CDanis: [C: 03+2] statograph: Initial commit [software/statograph] - 10https://gerrit.wikimedia.org/r/701599 (https://phabricator.wikimedia.org/T285569) (owner: 10CDanis) [17:20:27] oohhh setting the size of the window ahead of time, nice [17:20:42] (03CR) 10Krinkle: "How detailed should this intake be? afaik we usually don't expose general/unfiltered edge traffic in full detail except via stats-private-" [puppet] - 10https://gerrit.wikimedia.org/r/702163 (owner: 10Cwhite) [17:22:47] 10SRE, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to analytics cluster for btullis - https://phabricator.wikimedia.org/T285754 (10herron) >>! In T285754#7184198, @Ottomata wrote: > @razzi will take care of this, and I will follow up with SRE on enabling root access after the initial acc... [17:22:48] (03CR) 10Jbond: [C: 03+2] postgresql: use correct unless priv for function check [puppet] - 10https://gerrit.wikimedia.org/r/702165 (owner: 10Jbond) [17:25:02] (03PS1) 10CDanis: Report upload_metrics failures in our exit code [software/statograph] - 10https://gerrit.wikimedia.org/r/702187 [17:25:38] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [17:26:54] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/statograph] - 10https://gerrit.wikimedia.org/r/702187 (owner: 10CDanis) [17:27:05] 10SRE, 10SRE-OnFire, 10observability: Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10Legoktm) >>! In T202061#7184427, @CDanis wrote: > After some more thought, I've removed "Developer tools" for now. Not only is there a lot to potentiall... [17:29:00] 10SRE, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to analytics cluster for btullis - https://phabricator.wikimedia.org/T285754 (10Ottomata) Oh, and approved by me for analytics-privatedata-users. @herron if you have time to do this now I'm sure @razzi would not mind, we just wanted to... [17:31:20] RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [17:32:00] (03PS2) 10CDanis: Report upload_metrics failures in our exit code [software/statograph] - 10https://gerrit.wikimedia.org/r/702187 (https://phabricator.wikimedia.org/T285569) [17:32:50] 10SRE, 10ops-codfw: Degraded RAID on mw2380 - https://phabricator.wikimedia.org/T285603 (10Papaul) Create Dispatch: Success You have successfully submitted request SR1063712714. [17:37:33] (03CR) 10CDanis: [C: 03+2] Report upload_metrics failures in our exit code [software/statograph] - 10https://gerrit.wikimedia.org/r/702187 (https://phabricator.wikimedia.org/T285569) (owner: 10CDanis) [17:38:46] (03Merged) 10jenkins-bot: Report upload_metrics failures in our exit code [software/statograph] - 10https://gerrit.wikimedia.org/r/702187 (https://phabricator.wikimedia.org/T285569) (owner: 10CDanis) [17:42:28] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/701936 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [17:48:34] 10SRE, 10SRE-OnFire, 10observability: Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10Legoktm) >>! In T202061#7176114, @CDanis wrote: > I was thinking we would have a status.wikimedia.org that serves a HTTP 302 to the other domain. I thin... [17:52:14] !log brennen@deploy1002 Finished scap: testwikis wikis to 1.37.0-wmf.12 (duration: 57m 11s) [17:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['cloudcephosd1016.eqiad.... [17:56:16] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/701936 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [17:56:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) a:05Cmjohnson→03RobH irc update: chris said all the cables were swapped in reality, so had to change them around (port... [17:59:30] !log Start server-side upload of ~2.5G of JPG files (T282755) [17:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:37] T282755: Request of server-side-upload of iptc-metadata-added versions of existing files - https://phabricator.wikimedia.org/T282755 [18:00:05] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210629T1800) [18:00:11] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30041/console" [puppet] - 10https://gerrit.wikimedia.org/r/701936 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [18:05:58] RECOVERY - WDQS high update lag on wdqs1012 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.152e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [18:07:53] !log brennen@deploy1002 Pruned MediaWiki: 1.37.0-wmf.7 (duration: 04m 00s) [18:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:23] (03PS3) 10Jbond: puppetdb::app: Use seperate user for the read databse [puppet] - 10https://gerrit.wikimedia.org/r/701936 (https://phabricator.wikimedia.org/T285666) [18:09:30] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30042/console" [puppet] - 10https://gerrit.wikimedia.org/r/701936 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [18:09:49] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1016.eqiad.wmnet with reason: REIMAGE [18:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:40] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 220, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:10:52] 10SRE, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to analytics cluster for btullis - https://phabricator.wikimedia.org/T285754 (10herron) Sure I'll go ahead and prep a patch. I may have missed it, but what realname should be used for btullis? Would also be ideal to log a comment of... [18:11:04] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:11:59] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1016.eqiad.wmnet with reason: REIMAGE [18:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:33] 10SRE, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to analytics cluster for Ben Tullis - https://phabricator.wikimedia.org/T285754 (10Ottomata) [18:14:02] 10SRE, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to analytics cluster for Ben Tullis - https://phabricator.wikimedia.org/T285754 (10Ottomata) Oh, Ben Tullis. Yes, we need approval from Ben's manager: @odimitrijevic Please approve! [18:14:19] 10SRE, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to analytics cluster for Ben Tullis - https://phabricator.wikimedia.org/T285754 (10odimitrijevic) Approved [18:15:41] (03PS1) 10Herron: admin: create shell user btullis, add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/702197 (https://phabricator.wikimedia.org/T285754) [18:19:53] 10SRE, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics cluster for Ben Tullis - https://phabricator.wikimedia.org/T285754 (10herron) [18:21:35] !log krinkle@mwmaint2002.codfw: mwscript purgeParserCache.php --wiki=aawiki --age=1814400 --msleep 200 --tag pc1 [18:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:56] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:25:26] (03CR) 10Razzi: [C: 03+1] admin: create shell user btullis, add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/702197 (https://phabricator.wikimedia.org/T285754) (owner: 10Herron) [18:25:28] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:27:02] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:28:47] !log krinkle@mwmaint2002.codfw: mwscript purgeParserCache.php --wiki=aawiki --age=1814400 --msleep 200 --tag pc2 [18:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:30] 10SRE, 10Infrastructure-Foundations, 10netbox, 10Patch-For-Review, 10User-jbond: Add SSO support to netbox - https://phabricator.wikimedia.org/T244849 (10jbond) 05Open→03Resolved This is in place now [18:31:58] my server-side upload job just started to print stuff like `Dyke*Line_auf_dem_Jungfernstieg_und_neuen_Jungfernstieg_und_auf_Booten_auf_der_Binnenalster_350.jpg exists, overwriting...failed. (The file "mwstore://local-multiwrite/local-public/8/89/Dyke*Line_auf_dem_Jungfernstieg_und_neuen_Jungfernstieg_und_auf_Booten_auf_der_Binnenalster_350.jpg" is in an inconsistent state within the internal storage backends)` [18:32:23] does anyone have any idea what "inconsistent state within the internal storage backends" refers to? [18:32:54] 10Puppet, 10Infrastructure-Foundations, 10Packaging, 10User-jbond: Explore packaging facter 4.0 - https://phabricator.wikimedia.org/T285043 (10jbond) 05Open→03Resolved I think this is closed, packaging facter 4.0 seems like it should be fairly trivial [18:33:04] (job stopped, of course) [18:33:35] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet CI should use rspec-parallel - https://phabricator.wikimedia.org/T284080 (10jbond) 05Open→03Resolved a:03jbond this has been added now [18:34:30] !log krinkle@mwmaint2002.codfw: mwscript purgeParserCache.php --wiki=aawiki --age=1814400 --msleep 200 --tag pc3 [18:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:33] (03PS1) 10RobH: cloudcephosd1016 insetup role [puppet] - 10https://gerrit.wikimedia.org/r/702199 (https://phabricator.wikimedia.org/T274945) [18:35:57] (03CR) 10Razzi: "I like having the motd, especially since kerberos errors are so verbose and cryptic... that's its own issue perhaps we can tackle another " [puppet] - 10https://gerrit.wikimedia.org/r/701512 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff) [18:36:44] (03PS2) 10RobH: cloudcephosd1016 insetup role [puppet] - 10https://gerrit.wikimedia.org/r/702199 (https://phabricator.wikimedia.org/T274945) [18:37:03] (03CR) 10RobH: [C: 03+2] cloudcephosd1016 insetup role [puppet] - 10https://gerrit.wikimedia.org/r/702199 (https://phabricator.wikimedia.org/T274945) (owner: 10RobH) [18:38:56] (03PS4) 10Zabe: flood flag changes for enwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701651 (https://phabricator.wikimedia.org/T285594) [18:39:51] (03CR) 10Herron: [C: 03+2] admin: create shell user btullis, add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/702197 (https://phabricator.wikimedia.org/T285754) (owner: 10Herron) [18:40:32] (03PS5) 10Zabe: flood flag changes for enwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701651 (https://phabricator.wikimedia.org/T285594) [18:40:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` clo... [18:43:16] (03PS6) 10Zabe: flood flag changes for enwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701651 (https://phabricator.wikimedia.org/T285594) [18:44:54] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:46:48] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:52:12] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:53:02] 10SRE, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics cluster for Ben Tullis - https://phabricator.wikimedia.org/T285754 (10herron) Shell account has been created, and ldap account has been added to group `wmf` [18:53:04] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:55:18] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:55:35] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1016.eqiad.wmnet with reason: REIMAGE [18:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:45] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1016.eqiad.wmnet with reason: REIMAGE [18:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:31] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:00:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` clo... [19:00:05] brennen and marxarelli: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210629T1900). [19:00:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1017.eqiad.wmnet'] ` Of which those... [19:00:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` clo... [19:00:20] brennen: btw beta's down with a fatal [19:00:26] dunno if beta specific, but if not, might be a blocker [19:02:14] urbanecm: https://en.wikipedia.beta.wmflabs.org/w/index.php?title=Main_Page&mobileaction=toggle_view_desktop&useskin=minervanueue works for me [19:02:29] RhinosF1: https://ar.wikipedia.beta.wmflabs.org/ does not [19:02:35] neither does https://cs.wikipedia.beta.wmflabs.org/ [19:02:42] hrm [19:02:51] urbanecm: T285345 [19:02:52] T285345: Python3 scap breaks mediawiki - https://phabricator.wikimedia.org/T285345 [19:03:03] urbanecm: that's beta specific [19:03:12] i wasn't sure, that's why i ask :) [19:03:15] Looks like majavah beat me [19:03:46] urbanecm: ok, proceeding. thanks for checking though. [19:04:33] (03PS1) 10Brennen Bearnes: group0 wikis to 1.37.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702213 [19:04:34] (03CR) 10Brennen Bearnes: [C: 03+2] group0 wikis to 1.37.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702213 (owner: 10Brennen Bearnes) [19:04:41] brennen: yeah, that was an outstanding issue last week as well. annoying but not a blocker [19:04:48] also something i'll try to poke at this week [19:04:53] cool [19:05:49] (03Merged) 10jenkins-bot: group0 wikis to 1.37.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702213 (owner: 10Brennen Bearnes) [19:07:05] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:07:10] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.37.0-wmf.12 [19:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:30] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for TChin - https://phabricator.wikimedia.org/T285326 (10MNadrofsky) Approved! :-) [19:08:30] * legoktm is around in case any switchover-related stuff pops up [19:09:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1016.eqiad.wmnet'] ` and were **ALL*... [19:09:39] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:10:21] quiet so far (as expected/hoped for group0). [19:11:25] * urbanecm hopes all groups go without errors :) [19:11:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` cloudcephosd1018.eqiad.wm... [19:11:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1018.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cloudc... [19:12:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` cloudcephosd1018.eqiad.wm... [19:13:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) [19:14:49] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1017.eqiad.wmnet with reason: REIMAGE [19:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:18] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1017.eqiad.wmnet with reason: REIMAGE [19:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` cloudcephosd1018.eqiad.wm... [19:18:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1018.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cloudc... [19:18:10] 10SRE, 10DBA, 10Datacenter-Switchover: When switching DCs, update pc hosts in tendril - https://phabricator.wikimedia.org/T266723 (10Legoktm) I've documented this as a manual step going forward: https://wikitech.wikimedia.org/w/index.php?title=Switch_Datacenter&type=revision&diff=1917204&oldid=1917111 [19:19:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` cloudcephosd1019.eqiad.wm... [19:24:00] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:25:18] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:25:47] 10SRE, 10Infrastructure-Foundations, 10netbox, 10Patch-For-Review, 10User-jbond: Add SSO support to netbox - https://phabricator.wikimedia.org/T244849 (10faidon) Thank you @jbond for picking this up and sheperding it - appreciate it! [19:26:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1017.eqiad.wmnet'] ` and were **ALL** successful. [19:26:30] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1018.eqiad.wmnet with reason: REIMAGE [19:26:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` cloudcephosd1020.eqiad.wm... [19:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:17] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:28:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) [19:28:14] (03PS1) 10Bartosz Dziewoński: Config option to enable topic subscriptions backend and dtenable=1 URL parameter [extensions/DiscussionTools] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/702171 (https://phabricator.wikimedia.org/T284491) [19:28:27] (03PS1) 10Bartosz Dziewoński: Config option to enable topic subscriptions backend and dtenable=1 URL parameter [extensions/DiscussionTools] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702172 (https://phabricator.wikimedia.org/T284491) [19:28:44] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1018.eqiad.wmnet with reason: REIMAGE [19:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:54] (03PS1) 10Faidon Liambotis: Use allowlist/blocklist instead of whitelist/blacklist [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/702217 [19:28:56] (03PS1) 10Faidon Liambotis: Fix the wording on some of the reports output [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/702218 [19:33:14] 10SRE, 10ops-eqiad: Degraded RAID on cloudcephosd1018 - https://phabricator.wikimedia.org/T285799 (10ops-monitoring-bot) [19:33:38] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1019.eqiad.wmnet with reason: REIMAGE [19:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:50] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1019.eqiad.wmnet with reason: REIMAGE [19:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:22] 10SRE, 10Datacenter-Switchover: switchdc cookbook should perform exponential backoff when checking DNS TTL - https://phabricator.wikimedia.org/T285800 (10Legoktm) [19:38:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1018.eqiad.wmnet'] ` and were **ALL** successful. [19:39:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) [19:41:04] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1020.eqiad.wmnet with reason: REIMAGE [19:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:14] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1020.eqiad.wmnet with reason: REIMAGE [19:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:08] dcausse: just to verify, you concluded that the search issues were independent of the DC switchover? [19:44:17] (03CR) 10Ottomata: [C: 03+2] camus events - change which DC topic is used for Hadoop ingestion alerts [puppet] - 10https://gerrit.wikimedia.org/r/702129 (https://phabricator.wikimedia.org/T266798) (owner: 10Ottomata) [19:45:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1019.eqiad.wmnet'] ` and were **ALL** successful. [19:45:05] 10SRE, 10DBA, 10Datacenter-Switchover: switchdc should automatically downtime "Read only" checks on DB masters being switched - https://phabricator.wikimedia.org/T285803 (10Legoktm) [19:48:19] 10SRE, 10Datacenter-Switchover: switchdc check on mwmaint for running PHP processes should ignore php-fpm processes - https://phabricator.wikimedia.org/T285804 (10Legoktm) [19:49:03] 10SRE, 10Datacenter-Switchover: switchdc check on mwmaint for running PHP processes should ignore php-fpm processes - https://phabricator.wikimedia.org/T285804 (10Legoktm) [19:51:07] (03PS1) 10TrainBranchBot: Update train-versions.json [mediawiki-config] (sandbox/dancy) - 10https://gerrit.wikimedia.org/r/702223 [19:51:09] (03CR) 10TrainBranchBot: [C: 03+2] Update train-versions.json [mediawiki-config] (sandbox/dancy) - 10https://gerrit.wikimedia.org/r/702223 (owner: 10TrainBranchBot) [19:53:01] (03Merged) 10jenkins-bot: Update train-versions.json [mediawiki-config] (sandbox/dancy) - 10https://gerrit.wikimedia.org/r/702223 (owner: 10TrainBranchBot) [19:53:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1020.eqiad.wmnet'] ` and were **ALL** successful. [19:55:38] is it just me or the master of core is broken? [19:56:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) [19:56:24] nvm [19:56:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) 05Open→03Resolved @Andrew these are now ready for your use [19:59:23] (03PS1) 10Herron: add tchin to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/702224 (https://phabricator.wikimedia.org/T285326) [19:59:50] (03PS1) 10Ahmon Dancy: Add empty train-versions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702225 (https://phabricator.wikimedia.org/T282824) [20:00:13] (03CR) 10Ahmon Dancy: [C: 03+2] Add empty train-versions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702225 (https://phabricator.wikimedia.org/T282824) (owner: 10Ahmon Dancy) [20:01:48] 10SRE, 10Datacenter-Switchover: Document communication expectations around planning a DC switchover - https://phabricator.wikimedia.org/T285806 (10Legoktm) [20:03:09] (03Merged) 10jenkins-bot: Add empty train-versions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702225 (https://phabricator.wikimedia.org/T282824) (owner: 10Ahmon Dancy) [20:04:30] (03PS1) 10TrainBranchBot: Update train-versions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702226 [20:04:32] (03CR) 10TrainBranchBot: [C: 03+2] Update train-versions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702226 (owner: 10TrainBranchBot) [20:07:26] (03Merged) 10jenkins-bot: Update train-versions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702226 (owner: 10TrainBranchBot) [20:16:39] (03PS1) 10Cwhite: admin: add cwhite to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/702229 [20:18:22] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:20:18] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:26:26] !log Reverting to scap 3.17.1-1+0~20210419163335.8~1.gbpa6b2e0 in beta [20:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:45] 10ops-eqiad, 10DC-Ops: (Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10RobH) [20:38:28] 10ops-eqiad, 10DC-Ops: (Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10RobH) [20:39:11] PROBLEM - Thanos compact has high percentage of failures on alert1001 is CRITICAL: job=thanos-compact https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [20:40:36] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for fgoodwin - https://phabricator.wikimedia.org/T285580 (10MNadrofsky) Approved. [20:40:39] RECOVERY - Thanos compact has high percentage of failures on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [20:47:32] (03CR) 10Cwhite: "Thanks for having a look! These are good questions." [puppet] - 10https://gerrit.wikimedia.org/r/702163 (owner: 10Cwhite) [20:54:41] (03CR) 10Ladsgroup: "Thanks but I don't have +2 rights to merge it. Can you do it when you have time?" [puppet] - 10https://gerrit.wikimedia.org/r/702145 (owner: 10Ladsgroup) [20:55:02] !log Deleted all CDB files on beta so they'll be recreated on the next scap sync-world run [20:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:25] 10ops-codfw, 10DC-Ops: (Need By: ASAP) rack/setup/install ms-be20[62-65] - https://phabricator.wikimedia.org/T285809 (10RobH) [20:59:14] 10ops-codfw, 10DC-Ops: (Need By: ASAP) rack/setup/install ms-be20[62-65] - https://phabricator.wikimedia.org/T285809 (10RobH) [21:03:37] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:05:04] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudcephosd1018 - https://phabricator.wikimedia.org/T285799 (10Peachey88) [21:07:31] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:08:34] 10SRE, 10SRE-OnFire, 10observability: Ensure SRE team has a good understanding of how & when to declare an outage on the status page; & it is easy to do so - https://phabricator.wikimedia.org/T285769 (10Peachey88) [21:09:13] PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rasdaemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:18:28] Hey all - the secteam is deploying two security patches right now for T285190 and T285515. Unless we should hold for any reason. [21:19:17] !log Deployed updated security patch for T285190 to wmf.11 [21:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:29] noticing a rise in mediawiki-errors in logstash, investigating further [21:26:47] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:27:30] planning to revert the changes, errors seem related to the newly deployed security patch [21:28:03] sbassett: ^ [21:28:50] RhinosF1: yep, well aware. Revert going out for wmf.11 now. wmf.12 soon. [21:29:53] !log Reverted and deployed updated security patch for T285190 to wmf.11 [21:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:35] Didn't see maryum was sec team until I searched them on phab. [21:30:38] I guess welcome [21:31:58] !log Reverted and deployed updated security patch for T285190 to wmf.12 [21:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:35] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:38:49] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:39:21] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:40:02] 10SRE, 10ops-eqiad, 10SRE-swift-storage, 10User-fgiunchedi: Decom ms-be[1019-1026] - https://phabricator.wikimedia.org/T272836 (10Jclark-ctr) [21:40:31] 10SRE, 10ops-eqiad, 10SRE-swift-storage, 10User-fgiunchedi: Decom ms-be[1019-1026] - https://phabricator.wikimedia.org/T272836 (10Jclark-ctr) Removed ms-be1026 to rack new host [21:40:45] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:42:19] !log deployed updated security patch for T285190 to wmf.11 [21:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:37] !log deployed updated security patch for T285190 to wmf.12 [21:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:15] !log deployed security patch T285515 to wmf.11 [21:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:59] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:58:03] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:58:20] !log deployed security patch T285515 to wmf.12 [21:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:53] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert, rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [22:07:41] (03PS1) 10Urbanecm: SpecialEditGrowthConfig: Do not use relative => true [extensions/GrowthExperiments] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/702173 (https://phabricator.wikimedia.org/T285750) [22:07:56] (03PS1) 10Urbanecm: SpecialEditGrowthConfig: Do not use relative => true [extensions/GrowthExperiments] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702174 (https://phabricator.wikimedia.org/T285750) [22:19:09] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:20:48] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10Jclark-ctr) backup1004. A4 u9 port1 Cableid#5320 backup1005. B4 u27 port11 Cableid#5351 backup1006. C2 U15 port21 Cableid#6011 backup1007. D7 U13 port12 C... [22:21:00] (03CR) 10Bstorm: [C: 03+2] "I've disabled puppet briefly on labstore1006 so I can try this safely on 1007 😊" [puppet] - 10https://gerrit.wikimedia.org/r/698976 (https://phabricator.wikimedia.org/T164454) (owner: 10Muehlenhoff) [22:23:28] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10Jclark-ctr) [22:23:55] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [22:28:38] (03CR) 10Razzi: [C: 03+2] "I'll merge this!" [puppet] - 10https://gerrit.wikimedia.org/r/702145 (owner: 10Ladsgroup) [22:29:36] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): labstore1007 crashed after storage controller errors--replace disk? - https://phabricator.wikimedia.org/T281045 (10Jclark-ctr) Replaced drive bay5 [22:29:45] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [22:31:09] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:31:51] (03CR) 10Bstorm: "So, this hasn't had any affect. The puppetization seems to just be happy that nginx is installed. It will cause a change when rebuilds hap" [puppet] - 10https://gerrit.wikimedia.org/r/698976 (https://phabricator.wikimedia.org/T164454) (owner: 10Muehlenhoff) [22:31:58] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [22:33:20] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): labstore1007 crashed after storage controller errors--replace disk? - https://phabricator.wikimedia.org/T281045 (10Bstorm) physicaldrive 1I:1:5 (port 1I:box 1:bay 5, 6001.1 GB): Rebuilding Looking good. [22:39:47] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:40:26] (03CR) 10Bstorm: "Well...no the package is changed. It just seems to still have the modules on disk. That seems weird to me." [puppet] - 10https://gerrit.wikimedia.org/r/698976 (https://phabricator.wikimedia.org/T164454) (owner: 10Muehlenhoff) [22:40:52] 10SRE, 10ops-eqiad, 10Discovery, 10Discovery-Search (Current work): Degraded RAID on elastic1039 - https://phabricator.wikimedia.org/T285643 (10wiki_willy) a:03Cmjohnson [22:41:09] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): labstore1007 crashed after storage controller errors--replace disk? - https://phabricator.wikimedia.org/T281045 (10Jclark-ctr) [22:41:37] 10SRE, 10ops-eqiad: Disk failed on thanos-be1003 - https://phabricator.wikimedia.org/T285664 (10wiki_willy) a:03Cmjohnson [22:41:43] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudcephosd1018 - https://phabricator.wikimedia.org/T285799 (10wiki_willy) a:03Cmjohnson [22:42:51] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [22:43:35] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:44:29] 10SRE, 10ops-eqiad: Disk failed on thanos-be1003 - https://phabricator.wikimedia.org/T285664 (10Jclark-ctr) @wiki_willy @fgiunched looks healthy no visibly failed drives [22:44:58] (03CR) 10Bstorm: "Anyway, this is done. On rebuild it might look somewhat different, and that is sooner than later." [puppet] - 10https://gerrit.wikimedia.org/r/698976 (https://phabricator.wikimedia.org/T164454) (owner: 10Muehlenhoff) [22:50:31] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:57:03] (03PS2) 10Zabe: Avoid using MWNamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697851 [22:59:50] jouncebot: now [22:59:50] No deployments scheduled for the next 0 hour(s) and 0 minute(s) [22:59:52] jouncebot: next [22:59:53] In 0 hour(s) and 0 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210629T2300) [23:00:05] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210629T2300) [23:00:05] MatmaRex, Urbanecm, and zabe: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:08] (03CR) 10Urbanecm: [C: 03+2] SpecialEditGrowthConfig: Do not use relative => true [extensions/GrowthExperiments] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702174 (https://phabricator.wikimedia.org/T285750) (owner: 10Urbanecm) [23:00:12] (03CR) 10Urbanecm: [C: 03+2] SpecialEditGrowthConfig: Do not use relative => true [extensions/GrowthExperiments] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/702173 (https://phabricator.wikimedia.org/T285750) (owner: 10Urbanecm) [23:00:13] i can deploy today [23:00:15] hellloooo [23:00:19] o/ [23:00:22] hey MatmaRex [23:00:25] and zabe :) [23:00:29] (03CR) 10Urbanecm: [C: 03+2] Config option to enable topic subscriptions backend and dtenable=1 URL parameter [extensions/DiscussionTools] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/702171 (https://phabricator.wikimedia.org/T284491) (owner: 10Bartosz Dziewoński) [23:00:33] (03CR) 10Urbanecm: [C: 03+2] Config option to enable topic subscriptions backend and dtenable=1 URL parameter [extensions/DiscussionTools] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702172 (https://phabricator.wikimedia.org/T284491) (owner: 10Bartosz Dziewoński) [23:01:43] zabe: hello, ad https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/697851, would it be possible to deploy this at a time when WMDE folks are more likely to be available (ideally, the EU window)? [23:02:50] they are the experts on that one, so sure, I can do that with them being around [23:03:14] (removed that one from the calander) [23:03:17] thanks [23:03:26] (03CR) 10Urbanecm: [C: 04-1] "minor comment, otherwise LGTM" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701651 (https://phabricator.wikimedia.org/T285594) (owner: 10Zabe) [23:03:36] and please have a look at this comment,too :) [23:08:07] (03PS7) 10Zabe: flood flag changes for enwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701651 (https://phabricator.wikimedia.org/T285594) [23:09:09] (03CR) 10Urbanecm: [C: 03+2] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701651 (https://phabricator.wikimedia.org/T285594) (owner: 10Zabe) [23:10:06] (03Merged) 10jenkins-bot: flood flag changes for enwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701651 (https://phabricator.wikimedia.org/T285594) (owner: 10Zabe) [23:11:21] meh... [23:11:29] someone updated train-versions.json and did not update deployment host [23:11:41] dancy: ^^ is that safe to ignore? [23:16:09] urbanecm: guessing so [23:16:41] I'd rather not guess when it comes to updating production :/ [23:17:33] urbanecm: yeah, one second while i dig into this a bit. d.ancy is off for the evening. [23:17:49] thanks brennen [23:21:39] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:23:30] (03Merged) 10jenkins-bot: SpecialEditGrowthConfig: Do not use relative => true [extensions/GrowthExperiments] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702174 (https://phabricator.wikimedia.org/T285750) (owner: 10Urbanecm) [23:23:32] (03Merged) 10jenkins-bot: SpecialEditGrowthConfig: Do not use relative => true [extensions/GrowthExperiments] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/702173 (https://phabricator.wikimedia.org/T285750) (owner: 10Urbanecm) [23:23:35] (03Merged) 10jenkins-bot: Config option to enable topic subscriptions backend and dtenable=1 URL parameter [extensions/DiscussionTools] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/702171 (https://phabricator.wikimedia.org/T284491) (owner: 10Bartosz Dziewoński) [23:23:35] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:23:37] (03Merged) 10jenkins-bot: Config option to enable topic subscriptions backend and dtenable=1 URL parameter [extensions/DiscussionTools] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702172 (https://phabricator.wikimedia.org/T284491) (owner: 10Bartosz Dziewoński) [23:23:39] urbanecm: this is part of work on T282824; it's referenced basically nowhere i can find and is a very recent addition to mw/config. i think you're safe to proceed. [23:23:40] T282824: MW container image build workflow vs docker-registry caching - https://phabricator.wikimedia.org/T282824 [23:24:31] brennen: thanks for looking into this [23:25:50] zabe: MatmaRex: commits pulled onto mwdebug2001 (note it's in codfw now) [23:25:55] (03CR) 10Bstorm: d/changelog: Prepare for 0.75 release (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/700095 (owner: 10Bstorm) [23:27:09] urbanecm: looks good, i subscribed to https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)?dtenable=1#Reply_tool,_for_coders :o [23:27:21] MatmaRex: thanks, syncing [23:27:33] my patch also lgtm, syncing [23:28:32] urbanecm: my contains a small mistake (I forgot to remove the one that sysops can add to their self, with is unnecassary when they can add flood to any account). Whould you rather revert or add a small fixing patch? [23:28:47] zabe: definitely upload a followup :) [23:28:58] ok, doing [23:30:18] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.12/extensions/DiscussionTools/: e77e002130a7815570ff967013757bacc7037fb0: Config option to enable topic subscriptions backend and dtenable=1 URL parameter (T284491) (duration: 01m 09s) [23:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:27] T284491: Make topic subscriptions available via URL parameter - https://phabricator.wikimedia.org/T284491 [23:31:40] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.12/extensions/DiscussionTools/: bad82665f8bea667aff049794612c270063c7519: Config option to enable topic subscriptions backend and dtenable=1 URL parameter (T284491) (duration: 01m 06s) [23:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:00] one more sync, as i synced two times the same patch... [23:32:43] i was just going to say :D [23:33:06] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.11/extensions/DiscussionTools/: bad82665f8bea667aff049794612c270063c7519: Config option to enable topic subscriptions backend and dtenable=1 URL parameter (T284491) (duration: 01m 05s) [23:33:09] it shouldn't hurt to run this thing multiple times, at least in theory :D [23:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:14] MatmaRex: live for real now [23:34:28] urbanecm: can you double-check that the wmf.12 branch has the wmf.12 patch? it looks like the wmf.11 patch was synchronized [23:34:38] unless tht was a copy-paste mistake, i don't know if these messages are automatic? [23:34:50] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.12/extensions/GrowthExperiments/includes/Specials/SpecialEditGrowthConfig.php: c61fb175c82accda526105cc32457b07530d09fa: SpecialEditGrowthConfig: Do not use relative => true (T285750) (duration: 01m 05s) [23:34:54] the path is, the justification is manual [23:34:54] (03PS3) 10Zabe: Remove ability for sysops to add themself to flood on enwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702175 (https://phabricator.wikimedia.org/T285594) [23:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:56] (i mean, it looks that way in the log messages) [23:34:57] T285750: Do not use relative => true until autocomplete bug is fixed - https://phabricator.wikimedia.org/T285750 [23:35:01] RECOVERY - Device not healthy -SMART- on labstore1007 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labstore1007&var-datasource=eqiad+prometheus/ops [23:35:02] let me check [23:35:09] urbanecm: ok, that makes sense, thanks [23:36:21] (DiscussionTools's wmf.11 and wmf.12 actually only differ by some localisation messages) [23:36:35] (03PS1) 10TrainBranchBot: Update train-versions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702247 [23:36:37] (03CR) 10TrainBranchBot: [C: 03+2] Update train-versions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702247 (owner: 10TrainBranchBot) [23:36:46] what [23:36:52] a bot that automatically CR+2 in config? [23:37:06] (03CR) 10Urbanecm: [C: 04-2] "this will cause undeployed patch in deployment, and runs during window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702247 (owner: 10TrainBranchBot) [23:37:11] brennen: ^^ [23:38:00] MatmaRex: The config variable is there https://www.irccloud.com/pastebin/STof7RjB/ [23:38:08] and it looks it was added by the backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/702172 [23:38:24] that bot already did that once (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/702226/) [23:38:31] yeah. thanks [23:38:45] urbanecm: I uploaded a follow-up patch ^. I also added it to the calander. [23:38:53] any time MatmaRex [23:39:36] that bot's not supposed to CR+2 in config at all (unless it also ssh'es there and deploys the JSON) [23:40:04] (03CR) 10Urbanecm: [C: 03+2] Remove ability for sysops to add themself to flood on enwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702175 (https://phabricator.wikimedia.org/T285594) (owner: 10Zabe) [23:40:26] twentyafterfour: looks like your bot ^^^ [23:40:38] (according to its email in gerrit) [23:40:46] (03Merged) 10jenkins-bot: Remove ability for sysops to add themself to flood on enwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702175 (https://phabricator.wikimedia.org/T285594) (owner: 10Zabe) [23:40:57] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:41:14] (I'll fetch to mwdebug once my second sync for my backport completes) [23:41:16] we've automatically +2'd the train branch for a while now, but this is a bit different. [23:41:39] train branch is fine (as by the time that job runs, it's not even fetched on deployment machine) [23:41:43] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.11/extensions/GrowthExperiments/includes/Specials/SpecialEditGrowthConfig.php: 8a5b835050cc0f6d47b6fde2317db8743fcb9ce0: SpecialEditGrowthConfig: Do not use relative => true (T285750) (duration: 01m 04s) [23:41:48] this one will confuse deployers [23:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:50] T285750: Do not use relative => true until autocomplete bug is fixed - https://phabricator.wikimedia.org/T285750 [23:42:00] brennen: could you disable that job perhaps? [23:42:32] (and cause undeployed code as soon as the json is used anywhere) [23:42:47] zabe: your followup is at mwdebug2001, please test [23:42:51] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:43:26] urbanecm: yes, now it works the supposed way [23:43:30] great, syncing [23:44:13] (03PS1) 10TrainBranchBot: Update train-versions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702248 [23:44:15] (03CR) 10TrainBranchBot: [C: 03+2] Update train-versions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702248 (owner: 10TrainBranchBot) [23:44:28] (03CR) 10Urbanecm: [C: 04-2] "no, this won't happen during deployment window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702248 (owner: 10TrainBranchBot) [23:45:10] !log Remove TrainBranchBot from wmf-deployment Gerrit group, merges code to mediawiki-config without actually deploying it [23:45:14] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 367bc98: 904d18720: flood flag changes for enwikibooks (T285594) (duration: 01m 07s) [23:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:21] T285594: Pseudobot changes for enwikibooks - https://phabricator.wikimedia.org/T285594 [23:45:26] zabe: should be live [23:45:31] anything else? [23:45:44] no, thanks for your help :) [23:45:49] any time :) [23:45:55] !log Evening B&C window done [23:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:03] (I'll file the train branch bot thing soon) [23:50:11] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:53:10] urbanecm: i've got to run, but please comment on T282824. imagine we can resolve in US morning. [23:53:11] T282824: MW container image build workflow vs docker-registry caching - https://phabricator.wikimedia.org/T282824 [23:53:25] ack [23:53:54] (03CR) 10Bstorm: "This has sat around a while. Since it is almost always possible to roll back. I will probably merge this tomorrow in my day along with the" [puppet] - 10https://gerrit.wikimedia.org/r/684100 (owner: 10Majavah) [23:57:25] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status