[00:02:54] (03PS1) 10Urbanecm: Growth mentor dashboard: Enable on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713370 (https://phabricator.wikimedia.org/T278920) [00:04:08] (03CR) 10Urbanecm: [C: 03+2] Growth mentor dashboard: Enable on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713370 (https://phabricator.wikimedia.org/T278920) (owner: 10Urbanecm) [00:05:02] (03Merged) 10jenkins-bot: Growth mentor dashboard: Enable on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713370 (https://phabricator.wikimedia.org/T278920) (owner: 10Urbanecm) [00:07:45] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: eccdd3ed3fda1abee9a4c57719afd0d1faae41c3: Growth mentor dashboard: Enable on testwiki (T278920) (duration: 00m 59s) [00:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:53] T278920: Mentor dashboard: V1 desktop - https://phabricator.wikimedia.org/T278920 [00:11:03] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2059-production-search-omega-codfw on elastic2059 is OK: (C)100 gt (W)80 gt 60 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-omega-codfw&var-instance=elastic2059&panelId=37 [00:12:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:24:55] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:04:38] 10SRE, 10Wikimedia-Apache-configuration, 10serviceops: Investigate and restore K.A.Z httpbb test - https://phabricator.wikimedia.org/T289022 (10RLazarus) [01:04:53] (03PS1) 10RLazarus: httpbb: Remove the failing K.A.Z test pending investigation. [puppet] - 10https://gerrit.wikimedia.org/r/713375 (https://phabricator.wikimedia.org/T289022) [02:00:04] Deploy window Branching MediaWiki, extensions, skins, and vendor – See Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210817T0200) [02:04:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:47] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.37.0-wmf.19 [core] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713380 [02:06:49] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.37.0-wmf.19 [core] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713380 (owner: 10TrainBranchBot) [02:29:06] (03Merged) 10jenkins-bot: Branch commit for wmf/1.37.0-wmf.19 [core] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713380 (owner: 10TrainBranchBot) [02:37:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:39:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:52:28] (03CR) 10Andrew Bogott: [C: 03+2] logstash: forward nova-fullstack logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/713323 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [03:34:07] PROBLEM - Host wdqs1013 is DOWN: PING CRITICAL - Packet loss = 100% [03:36:33] RECOVERY - Host wdqs1013 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [03:45:24] (03Abandoned) 10Andrew Bogott: nova-fullstack: send logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/713006 (owner: 10Andrew Bogott) [04:14:37] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:27:23] PROBLEM - Elevated latency for icinga checks in codfw on alert1001 is CRITICAL: cluster=alerting instance=alert2001 job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [04:39:19] RECOVERY - Elevated latency for icinga checks in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [04:46:29] (03PS1) 10Marostegui: dbproxy2004: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/713384 (https://phabricator.wikimedia.org/T288093) [04:47:36] (03CR) 10Juan90264: [C: 03+1] Use Wikimania's logo in a new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704167 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [04:51:13] (03CR) 10Marostegui: [C: 03+2] dbproxy2004: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/713384 (https://phabricator.wikimedia.org/T288093) (owner: 10Marostegui) [05:07:30] 10SRE, 10Anti-Harassment, 10Traffic: Enable automatic redirection to the mobile version of votewiki - https://phabricator.wikimedia.org/T288938 (10phuedx) Many thanks to @ssingh and #traffic for the very quick turnaround. >>! In T288938#7284430, @phuedx wrote: > @dom_walden noted that after he'd jumped from... [05:10:24] (03PS1) 10Tim Starling: Make DBStore be able to load election properties from foreign wikis [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713349 [05:15:29] (03CR) 10Tim Starling: [C: 03+2] Make DBStore be able to load election properties from foreign wikis [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713349 (owner: 10Tim Starling) [05:20:31] (03Merged) 10jenkins-bot: Make DBStore be able to load election properties from foreign wikis [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713349 (owner: 10Tim Starling) [05:21:57] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 102 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:22:08] (03PS1) 10Tim Starling: New CLI scripts makeMailingList.php and deduplicateMailingList.php [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713350 [05:22:22] (03CR) 10Tim Starling: [C: 03+2] New CLI scripts makeMailingList.php and deduplicateMailingList.php [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713350 (owner: 10Tim Starling) [05:22:35] (03PS1) 10Tim Starling: Add mail sending script [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713351 [05:22:41] (03CR) 10Tim Starling: [C: 03+2] Add mail sending script [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713351 (owner: 10Tim Starling) [05:23:01] (03PS1) 10Tim Starling: Add property mobile-jump-url [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713352 (https://phabricator.wikimedia.org/T289016) [05:23:16] (03PS1) 10Tim Starling: Add script importGlobalVoterList.php [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713353 [05:23:24] (03CR) 10Tim Starling: [C: 03+2] Add property mobile-jump-url [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713352 (https://phabricator.wikimedia.org/T289016) (owner: 10Tim Starling) [05:23:28] (03CR) 10Tim Starling: [C: 03+2] Add script importGlobalVoterList.php [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713353 (owner: 10Tim Starling) [05:23:59] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 104 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:26:54] (03Merged) 10jenkins-bot: New CLI scripts makeMailingList.php and deduplicateMailingList.php [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713350 (owner: 10Tim Starling) [05:27:00] (03Merged) 10jenkins-bot: Add mail sending script [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713351 (owner: 10Tim Starling) [05:27:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [05:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:41] (03Merged) 10jenkins-bot: Add property mobile-jump-url [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713352 (https://phabricator.wikimedia.org/T289016) (owner: 10Tim Starling) [05:28:01] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:28:18] (03Merged) 10jenkins-bot: Add script importGlobalVoterList.php [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713353 (owner: 10Tim Starling) [05:28:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [05:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:05] (03PS1) 10Tim Starling: Add property mobile-jump-url [extensions/SecurePoll] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713354 (https://phabricator.wikimedia.org/T289016) [05:29:21] (03CR) 10Tim Starling: [C: 03+2] Add property mobile-jump-url [extensions/SecurePoll] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713354 (https://phabricator.wikimedia.org/T289016) (owner: 10Tim Starling) [05:30:27] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 47 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:33:25] !log tstarling@deploy1002 Started scap: collected SecurePoll maintenance scripts and bug fix [05:33:30] (03Merged) 10jenkins-bot: Add property mobile-jump-url [extensions/SecurePoll] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713354 (https://phabricator.wikimedia.org/T289016) (owner: 10Tim Starling) [05:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [05:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [05:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:37] !log tstarling@deploy1002 Finished scap: collected SecurePoll maintenance scripts and bug fix (duration: 04m 12s) [05:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [05:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [05:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:34] !log foreachwikiindblist securepollglobal mysql.php --write -- -e 'insert into securepoll_properties (pr_entity,pr_key,pr_value) select el_entity,'\''mobile-jump-url'\'','\''https://vote.m.wikimedia.org/wiki/Special:SecurePoll'\'' from securepoll_elections where el_title='\''DWalden STV Election Test 456'\'' limit 1;' [05:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:18] 10ops-eqdfw, 10Infrastructure-Foundations, 10netops: cr2-eqdfw: PEM 1 Not Powered - https://phabricator.wikimedia.org/T289028 (10ayounsi) [06:25:48] (03PS1) 10MusikAnimal: Revert "Hide disambiguator-link-added tag temporarily" [extensions/Disambiguator] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713355 [06:27:11] (03PS2) 10MusikAnimal: Revert "Hide disambiguator-link-added tag temporarily" [extensions/Disambiguator] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713355 [06:28:14] (03PS1) 10Tim Starling: Filter encryption keys out of public dumps [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713356 (https://phabricator.wikimedia.org/T288924) [06:28:42] (03PS1) 10MusikAnimal: Apply the disambiguator-link-added tag in onRecentChange_save [extensions/Disambiguator] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713357 (https://phabricator.wikimedia.org/T287549) [06:29:14] (03CR) 10Tim Starling: "I'll deploy this tomorrow. Just cherry picking now so I don't forget it." [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713356 (https://phabricator.wikimedia.org/T288924) (owner: 10Tim Starling) [06:44:45] (03PS1) 10ZPapierski: Test - verify task manager count impact [deployment-charts] - 10https://gerrit.wikimedia.org/r/713429 [06:46:58] (03CR) 10Tim Starling: [C: 03+2] "I'm still awake so I may as well do it now." [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713356 (https://phabricator.wikimedia.org/T288924) (owner: 10Tim Starling) [06:51:13] (03Merged) 10jenkins-bot: Filter encryption keys out of public dumps [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713356 (https://phabricator.wikimedia.org/T288924) (owner: 10Tim Starling) [06:54:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [06:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:50] !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.18/extensions/SecurePoll/cli/dump.php: T288924 (duration: 00m 58s) [06:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:58] T288924: Private keys visible to anonymous users in SecurePoll dump - https://phabricator.wikimedia.org/T288924 [06:56:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [06:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:04] !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.18/extensions/SecurePoll/includes/Entities/Election.php: T288924 (duration: 00m 57s) [06:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:15] (03CR) 10Dzahn: [C: 03+1] "makes sense to me. first remove, later investigate" [puppet] - 10https://gerrit.wikimedia.org/r/713375 (https://phabricator.wikimedia.org/T289022) (owner: 10RLazarus) [07:39:38] !log jelto@cumin1001 conftool action : set/pooled=yes; selector: name=mw144[7-9].eqiad.wmnet [07:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:54] !log jelto@cumin1001 conftool action : set/pooled=yes; selector: name=mw1450.eqiad.wmnet [07:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:45] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Jelto) [07:44:09] !log Drop aft_feedback tables on x1 T250715 [07:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:18] T250715: Drop (and archive?) aft_feedback - https://phabricator.wikimedia.org/T250715 [07:48:17] !log dzahn@cumin1001 conftool action : set/weight=30; selector: name=mw145[1-5].eqiad.wmnet [07:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:46] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw145[1-5].eqiad.wmnet [07:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:40] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw145[1-5].eqiad.wmnet [07:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:48] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [07:52:08] !log mw1451 through mw1455 - fresh hardware pooled the first time as appservers [07:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:38] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) 05Stalled→03Open [07:54:49] (03CR) 10Effie Mouzeli: [C: 03+2] Test - verify task manager count impact [deployment-charts] - 10https://gerrit.wikimedia.org/r/713429 (owner: 10ZPapierski) [07:57:15] (03Merged) 10jenkins-bot: Test - verify task manager count impact [deployment-charts] - 10https://gerrit.wikimedia.org/r/713429 (owner: 10ZPapierski) [07:58:14] ACKNOWLEDGEMENT - Host backup1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T286625 [07:59:45] !log jelto@cumin1001 conftool action : set/weight=30; selector: name=mw1450.eqiad.wmnet [07:59:49] !log mw1384 - start failed ferm service [07:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:59] RECOVERY - mediawiki-installation DSH group on mw1452 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [07:59:59] RECOVERY - mediawiki-installation DSH group on mw1453 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [07:59:59] RECOVERY - mediawiki-installation DSH group on mw1454 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [07:59:59] RECOVERY - mediawiki-installation DSH group on mw1455 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:00:09] !log jelto@cumin1001 conftool action : set/weight=30; selector: name=mw144[7-9].eqiad.wmnet [08:00:13] there we go [08:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:16] (03CR) 10Filippo Giunchedi: [C: 03+1] retire role::kafka::monitoring and kafkamon[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/713307 (https://phabricator.wikimedia.org/T252773) (owner: 10Herron) [08:00:27] RECOVERY - Check systemd state on mw1384 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:01:44] ACKNOWLEDGEMENT - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:06:22] !log zpapierski@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [08:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:00] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Spicerack, 10Datacenter-Switchover: switchdc should verify active/active DBs are read-write in both datacenters - https://phabricator.wikimedia.org/T287129 (10LSobanski) @Legoktm since we are getting closer to the switch back, is this task a requirement... [08:09:06] (03PS3) 10Jelto: site/conftool: remove mw1276 through mw1279 [puppet] - 10https://gerrit.wikimedia.org/r/704287 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [08:12:05] (03CR) 10Dzahn: [C: 03+2] site/conftool: remove mw1276 through mw1279 [puppet] - 10https://gerrit.wikimedia.org/r/704287 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [08:15:23] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on mw[1276-1279].eqiad.wmnet with reason: decom old appservers in eqiad T280203 [08:15:27] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mw[1276-1279].eqiad.wmnet with reason: decom old appservers in eqiad T280203 [08:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:32] T280203: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 [08:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:52] !log jelto@cumin1001 conftool action : set/pooled=no; selector: name=mw127[6-9].eqiad.wmnet [08:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:27] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1456.eqiad.wmnet with reason: REIMAGE [08:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:37] !log jelto@cumin1001 conftool action : set/pooled=inactive; selector: name=mw127[6-9].eqiad.wmnet [08:18:38] !log zpapierski@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [08:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:46] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1456.eqiad.wmnet with reason: REIMAGE [08:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:45] !log mw2383 - scap pull (still depooled because T286463 but alerts in Icinga since a while) [08:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:53] T286463: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 [08:24:36] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1456.eqiad.wmnet with reason: new setup [08:24:38] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1456.eqiad.wmnet with reason: new setup [08:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:55] !log jelto@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw[1276-1279].eqiad.wmnet [08:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:41] !log disable puppet on mediawiki hosts to merge 712920 [08:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:17] RECOVERY - Ensure local MW versions match expected deployment on mw2383 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [08:32:31] (03PS1) 10Ema: Add Varnish SLO dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/713440 (https://phabricator.wikimedia.org/T289036) [08:37:41] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw[1276-1279].eqiad.wmnet [08:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:50] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jelto@cumin1001 for hosts: `mw[1276-... [08:39:13] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Jelto) [08:39:48] (03PS1) 10Dzahn: miscweb: add helmfile.yaml and values under services.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/713441 (https://phabricator.wikimedia.org/T281538) [08:41:39] 10SRE, 10Analytics, 10Patch-For-Review: Trash cleanup cron spams on an-test hosts - https://phabricator.wikimedia.org/T286442 (10BTullis) I think it's just this single host an-test-client1001 that's sending this daily logspam now, isn't it? To me, the issue looks like we should just be setting up `/home` as... [08:47:23] (03PS2) 10Dzahn: miscweb: add helmfile.yaml and values under services.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/713441 (https://phabricator.wikimedia.org/T281538) [08:49:09] (03PS1) 10MVernon: disable notifications on db2121 for buster reimage [puppet] - 10https://gerrit.wikimedia.org/r/713442 (https://phabricator.wikimedia.org/T288244) [08:51:12] (03CR) 10Kormat: [C: 04-1] disable notifications on db2121 for buster reimage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713442 (https://phabricator.wikimedia.org/T288244) (owner: 10MVernon) [08:55:02] (03CR) 10Dzahn: "I am not sure how to describe this step in the commit message. Is there a better way to phrase it?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/713441 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [08:55:50] (03PS2) 10MVernon: db2121: Disable notifications for buster reimage [puppet] - 10https://gerrit.wikimedia.org/r/713442 (https://phabricator.wikimedia.org/T288244) [08:56:42] (03CR) 10Kormat: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/713442 (https://phabricator.wikimedia.org/T288244) (owner: 10MVernon) [08:57:35] (03CR) 10MVernon: [C: 03+2] db2121: Disable notifications for buster reimage [puppet] - 10https://gerrit.wikimedia.org/r/713442 (https://phabricator.wikimedia.org/T288244) (owner: 10MVernon) [09:00:38] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:08:55] !log dzahn@cumin1001 conftool action : set/weight=30; selector: name=mw1456.eqiad.wmnet [09:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:02] !log reimaging db2121 to buster T288244 [09:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:09] T288244: Upgrade s7 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T288244 [09:16:38] (03CR) 10Dzahn: "should the new jobs still have an $ensure to make sure they don't run in both DCs?" [puppet] - 10https://gerrit.wikimedia.org/r/710520 (https://phabricator.wikimedia.org/T288175) (owner: 10Ladsgroup) [09:17:17] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1456.eqiad.wmnet [09:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:27] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) a:05Dzahn→03Jelto Jelto, over to you, since you are removing the last 4 servers :) Just c... [09:18:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Dzahn) Thanks @Cmjohnson . just had to reimage mw1456 and now everything is done and in production (active in netbox) [09:19:29] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [09:20:45] !log mvernon@cumin1001 dbctl commit (dc=all): 'db2121 depooling: reimage to buster T288244', diff saved to https://phabricator.wikimedia.org/P17033 and previous config saved to /var/cache/conftool/dbconfig/20210817-092045-mvernon.json [09:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Dzahn) [09:21:39] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) 05Open→03Resolved done !:) All new servers are finally in production now. [09:24:44] 10SRE, 10MW-on-K8s, 10serviceops: Make HTTP calls work within mediawiki on kubernetes - https://phabricator.wikimedia.org/T288848 (10JMeybohm) I'd assume that MW makes HTTP calls to the public endpoints of MW. Those will be blocked in k8s as we generally prohibit egress traffic. I'm not sure this is the righ... [09:25:34] (03PS1) 10MMandere: varnish: Containerize varnish test environment [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) [09:37:58] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10cmooney) @robh no problem I will document the process there once it is fully clear to me. Update for now is that RIPE have gotten back to me a... [09:50:19] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2121.codfw.wmnet with reason: REIMAGE [09:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:38] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: disable ssl listening on mcrouter except on mwmaint2002 [puppet] - 10https://gerrit.wikimedia.org/r/712920 (https://phabricator.wikimedia.org/T288787) (owner: 10Effie Mouzeli) [09:52:57] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db2121.codfw.wmnet with reason: REIMAGE [09:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:49] !log enable puppet on mediawiki hosts [10:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:13] (03CR) 10Hnowlan: [C: 03+2] conftool: remove old maps hosts before decom [puppet] - 10https://gerrit.wikimedia.org/r/712932 (https://phabricator.wikimedia.org/T288810) (owner: 10Hnowlan) [10:15:21] (03PS2) 10Hnowlan: conftool: remove old maps hosts before decom [puppet] - 10https://gerrit.wikimedia.org/r/712932 (https://phabricator.wikimedia.org/T288810) [10:17:07] (03PS1) 10Phuedx: tallyElectionJob: Catch and log exceptions [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713361 (https://phabricator.wikimedia.org/T288361) [10:20:35] (03PS1) 10MVernon: Revert "db2121: Disable notifications for buster reimage" [puppet] - 10https://gerrit.wikimedia.org/r/713362 [10:21:49] (03CR) 10MVernon: [V: 03+2 C: 03+2] Revert "db2121: Disable notifications for buster reimage" [puppet] - 10https://gerrit.wikimedia.org/r/713362 (owner: 10MVernon) [10:22:01] jayme , effie : I'm struggling with what I think are connectivity issues with flink streaming updater on eqiad/codfw (staging works) - it seems that my Job Manager can't connect to task managers [10:23:20] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/rdf-streaming-updater/ - helmfile with values [10:23:45] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/flink-session-cluster/ - chart [10:24:25] I'm not super familiar how to set up networking correctly, but it works on staging, so I assumed it should be ok on production as well, but that doesn't seem to be the case [10:31:19] !log mvernon@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 25%: buster reimage T288244', diff saved to https://phabricator.wikimedia.org/P17034 and previous config saved to /var/cache/conftool/dbconfig/20210817-103118-mvernon.json [10:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:28] T288244: Upgrade s7 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T288244 [10:34:27] 10SRE, 10ops-eqdfw, 10Infrastructure-Foundations, 10netops: cr2-eqdfw: PEM 1 Not Powered - https://phabricator.wikimedia.org/T289028 (10ayounsi) 05Open→03Resolved Nevermind. It was an Equinix maintenance, it's now back to normal. [10:46:22] !log mvernon@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 50%: buster reimage T288244', diff saved to https://phabricator.wikimedia.org/P17035 and previous config saved to /var/cache/conftool/dbconfig/20210817-104622-mvernon.json [10:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:31] T288244: Upgrade s7 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T288244 [10:54:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: hw troubleshooting: system right cp board missing in new host backup1006 - https://phabricator.wikimedia.org/T286625 (10jcrespo) I am extending the downtime for an extra week. I just noticed backup1006.mgmt no longer getting ping since yesterday, di... [10:57:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: hw troubleshooting: system right cp board missing in new host backup1006 - https://phabricator.wikimedia.org/T286625 (10Cmjohnson) I’m sorry, I took it down to work on the server and was sidetracked with immediate network repairs and forgot to get ba... [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: Dear deployers, time to do the European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210817T1100). [11:00:05] phuedx: A patch you scheduled for European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:18] o/ [11:00:22] o/ [11:00:24] o/ [11:00:36] phuedx: do you want to self-service, or should I deploy? [11:01:14] urbanecm: I can deploy :) [11:01:21] okay, then go for it :) [11:01:26] !log mvernon@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 75%: buster reimage T288244', diff saved to https://phabricator.wikimedia.org/P17037 and previous config saved to /var/cache/conftool/dbconfig/20210817-110125-mvernon.json [11:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:35] T288244: Upgrade s7 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T288244 [11:01:44] (03CR) 10Phuedx: [C: 03+2] "BACKPORT!!1" [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713361 (https://phabricator.wikimedia.org/T288361) (owner: 10Phuedx) [11:02:53] such excitement [11:03:58] It'll pass. Given time [11:04:24] * Lucas_WMDE mopes glumly [11:06:31] (03Merged) 10jenkins-bot: tallyElectionJob: Catch and log exceptions [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713361 (https://phabricator.wikimedia.org/T288361) (owner: 10Phuedx) [11:06:42] that...didn't take long [11:07:26] given that the extension has apparently been having problems for weeks now, I’m not overly surprised if it doesn’t have many tests to run in CI… [11:07:52] it still has to run all the gated stuff though :) [11:09:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:05] Pulling onto mwdebug1002 [11:12:49] Yeah, SecurePoll has only 8% coverage. https://doc.wikimedia.org/cover-extensions/SecurePoll/ looks quite unhealthy. [11:13:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:02] (03PS1) 10Hnowlan: maps: move configuration overrides to main configuration [puppet] - 10https://gerrit.wikimedia.org/r/713451 (https://phabricator.wikimedia.org/T288810) [11:14:08] Testing testing [11:14:12] zabe: still better than CentralAuth :P [11:15:06] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [11:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:17] sadly thats true :/ [11:16:30] !log mvernon@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 100%: buster reimage T288244', diff saved to https://phabricator.wikimedia.org/P17038 and previous config saved to /var/cache/conftool/dbconfig/20210817-111629-mvernon.json [11:16:31] (03CR) 10Hnowlan: "This change also removes the tilerator_ncpu_ratio and moves the sync to twice daily rather than once daily (the old default)." [puppet] - 10https://gerrit.wikimedia.org/r/713451 (https://phabricator.wikimedia.org/T288810) (owner: 10Hnowlan) [11:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:37] T288244: Upgrade s7 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T288244 [11:17:27] jumped up a bit with the new database access service for which I added full coverage, but still only 4% for includes/ [11:20:31] Come to think of it: Can changes to jobs be tested on mwdebug? The job runner is elsewhere, right? [11:22:53] ^ urbanecm? [11:23:22] phuedx: you can test job get submitted, but you can't test the job itself [11:23:44] (we don't have a jobqueue for test jobs) [11:25:14] Interestingly, votewiki is acting as if it's read-only on mwdebug1002 [11:25:35] phuedx: that makes sense -- you need mwdebug2xxx [11:25:49] mwdebug1002 is in eqiad and talks to eqiad primary DB [11:25:52] (which _is_ read only) [11:25:53] I've just pulled to mwdebug2001 as I said that [11:25:59] Hah [11:26:09] we're not active/active...yet :) [11:26:22] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:26:26] * phuedx obviously needs more coffee [11:27:49] Alright. Job is being enqueued as expected. I'll sync and then test end-to-end [11:29:23] sounds good! [11:29:42] !log phuedx@deploy1002 Synchronized php-1.37.0-wmf.18/extensions/SecurePoll/includes/Jobs/TallyElectionJob.php: Backport: [[gerrit:713361|tallyElectionJob: Catch and log exceptions (T288361)]] (duration: 00m 58s) [11:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:50] T288361: SecurePoll: Tallying an encrypted election on votewiki produces no results - https://phabricator.wikimedia.org/T288361 [11:36:59] (03CR) 10Ema: "Fantastic work! Here's a first round of comments." [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere) [11:40:43] The job is executing as expected. I'm waiting to be added as an admin to a test encrypted poll so that I can test the tallying of encrypted elections end-to-end [11:41:22] you can probably add yourself using createAndPromote.php [11:41:55] Reedy: I'm definitely in the appropriate groups on votewiki but you have to be added as an admin on a per-poll basis [11:42:36] It's a row in the database but I'm reluctant to edit it ;) [11:42:42] Checking cli scripts now [11:42:50] serialized php? [11:43:01] (or json, maybe, if we're lucky) [11:43:21] Shouldn't be. I'm just reluctant to edit the prod DB if I can avoid it [11:44:46] which is it? [11:46:14] votewiki [11:47:44] i meant table/column ;P [11:48:40] :P [11:48:49] Reedy: I'm pretty sure the data is all the way down in externalstorage [11:49:09] as securepoll configures polls via wiki pages [11:49:19] (in theory you could edit it via edit.php) [11:49:43] It's the securepoll_properties table in this case [11:49:56] The list of admins is a pipe-separated list of usernames [11:50:15] interesting. wouldn't guess it judging by pages like https://vote.wikimedia.org/wiki/SecurePoll:1138 [11:51:01] phuedx: that poll lists you as an admin, and says "encrypt-type": "gpg", btw [11:51:39] urbanecm: Unfortunately, the poll hasn't finished yet so I can't tally it [11:51:52] maybe you can edit the ending date though? :D [11:52:40] I'll be writing a task for a CLI script after this [11:53:05] I'll update the row in the DB to add me as an admin to a test encrypted election [11:56:54] oh, duh [11:57:08] I forgot securepoll wasn't fully moved to abstract schema [11:57:13] so not everything is in the root sql file [11:57:26] (03PS1) 10Ladsgroup: wdqs: Avoid add trailing slash for querybuilder [puppet] - 10https://gerrit.wikimedia.org/r/713454 (https://phabricator.wikimedia.org/T266703) [11:58:20] Reedy: I've seen the patches. They're from a little while ago. Happy to help land them [11:59:01] I don't think they exist yet for the other tables [11:59:07] oh, wait, they do [11:59:25] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/SecurePoll/+/657196 conflicts massively though [11:59:46] 10SRE, 10Wikidata, 10Wikidata Query Builder, 10wdwb-tech, and 4 others: Deploy query builder to microsites (on top of the wdqs-ui) - https://phabricator.wikimedia.org/T266703 (10Ladsgroup) [11:59:58] * Reedy might have a look and rebase after lunch [11:59:58] I see an error reported. Rad! [12:00:25] Alright. Sorry for the delay, all [12:05:29] urbanecm: Also TIL about the SecurePoll namespace. Thanks [12:05:43] any time! [12:10:33] (03PS1) 10Jgiannelos: tegola-vector-tiles: Reduce staging replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/713458 [12:14:27] (03CR) 10Jgiannelos: "Reducing replicas because we maxed out the connection to postgres to the single read replica we use and deployments don't work when rollin" [deployment-charts] - 10https://gerrit.wikimedia.org/r/713458 (owner: 10Jgiannelos) [12:25:10] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10cmooney) RIPE have emailed back to confirm registration and provided image download ` From: RIPE Atlas (no reply) [mailto:no-reply@ripe.net] S... [12:25:32] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10cmooney) [12:33:26] (03PS1) 10Vgutierrez: envoyproxy: Support PreserveCase HeaderKeyFormat [puppet] - 10https://gerrit.wikimedia.org/r/713460 (https://phabricator.wikimedia.org/T271421) [12:39:11] (03CR) 10Vgutierrez: "NOOP at envoy level on mw-api and mw clusters: https://puppet-compiler.wmflabs.org/compiler1002/30597/" [puppet] - 10https://gerrit.wikimedia.org/r/713460 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [12:46:20] (03PS2) 10Vgutierrez: envoyproxy: Support PreserveCase HeaderKeyFormat [puppet] - 10https://gerrit.wikimedia.org/r/713460 (https://phabricator.wikimedia.org/T271421) [12:48:55] (03CR) 10Ema: varnish: Containerize varnish test environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere) [12:53:20] (03CR) 10MSantos: [C: 03+2] tegola-vector-tiles: Reduce staging replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/713458 (owner: 10Jgiannelos) [12:55:30] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [12:55:31] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [12:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:42] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [12:55:42] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [12:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:11] (03Merged) 10jenkins-bot: tegola-vector-tiles: Reduce staging replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/713458 (owner: 10Jgiannelos) [13:00:15] 10SRE, 10ops-codfw, 10DC-Ops: codfw: Netbox Error - https://phabricator.wikimedia.org/T288586 (10Papaul) @willy I am using a personal Cisco roll over cable that connects the device to the console server . This was done so that the device can be configured. once the configuration done I am planning on removin... [13:06:03] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [13:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:51] (03PS1) 10Kormat: ProductionServices: Promote pc2012 to primary of pc2. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713466 (https://phabricator.wikimedia.org/T284825) [13:31:16] (03PS1) 10Jgiannelos: tegola-vector-tiles: Reduce staging max_connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/713467 (https://phabricator.wikimedia.org/T283159) [13:33:21] (03PS2) 10Jgiannelos: tegola-vector-tiles: Reduce staging max_connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/713467 (https://phabricator.wikimedia.org/T283159) [13:39:38] (03PS1) 10MVernon: db1181: reflect that this is a candidate master [puppet] - 10https://gerrit.wikimedia.org/r/713469 [13:40:34] (03CR) 10Kormat: [C: 03+1] db1181: reflect that this is a candidate master [puppet] - 10https://gerrit.wikimedia.org/r/713469 (owner: 10MVernon) [13:41:42] (03CR) 10MVernon: [C: 03+2] db1181: reflect that this is a candidate master [puppet] - 10https://gerrit.wikimedia.org/r/713469 (owner: 10MVernon) [13:42:28] (03PS1) 10Kormat: ProductionServices: Promote pc2013 to primary of pc3. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713471 (https://phabricator.wikimedia.org/T284825) [13:44:42] (03CR) 10MSantos: [C: 03+2] tegola-vector-tiles: Reduce staging max_connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/713467 (https://phabricator.wikimedia.org/T283159) (owner: 10Jgiannelos) [13:47:05] (03Merged) 10jenkins-bot: tegola-vector-tiles: Reduce staging max_connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/713467 (https://phabricator.wikimedia.org/T283159) (owner: 10Jgiannelos) [13:47:46] (03PS2) 10Jcrespo: mediabackups: Add mysql grants for mediabackups [puppet] - 10https://gerrit.wikimedia.org/r/712993 (https://phabricator.wikimedia.org/T276442) [13:47:48] (03PS1) 10Jcrespo: backup: Update minio storage location to be /srv/objectstorage [puppet] - 10https://gerrit.wikimedia.org/r/713473 (https://phabricator.wikimedia.org/T276442) [13:48:23] (03PS2) 10Jcrespo: backup: Update minio storage location to be /srv/objectstorage [puppet] - 10https://gerrit.wikimedia.org/r/713473 (https://phabricator.wikimedia.org/T276442) [13:50:08] (03CR) 10Jcrespo: [C: 03+2] backup: Update minio storage location to be /srv/objectstorage [puppet] - 10https://gerrit.wikimedia.org/r/713473 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [13:50:15] (03PS3) 10Jcrespo: backup: Update minio storage location to be /srv/objectstorage [puppet] - 10https://gerrit.wikimedia.org/r/713473 (https://phabricator.wikimedia.org/T276442) [13:51:21] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [13:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:21] !log rolling restart of minio on backup server [13:53:28] (03CR) 10Kormat: "This will be submitted some time after the pc2 one." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713471 (https://phabricator.wikimedia.org/T284825) (owner: 10Kormat) [13:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:51] (03PS1) 10Andrew Bogott: nova-fullstack: add checking of dns ptr records [puppet] - 10https://gerrit.wikimedia.org/r/713475 (https://phabricator.wikimedia.org/T288854) [13:59:53] (03PS2) 10Andrew Bogott: nova-fullstack: add checking of dns ptr records [puppet] - 10https://gerrit.wikimedia.org/r/713475 (https://phabricator.wikimedia.org/T288854) [14:09:10] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30599/console" [puppet] - 10https://gerrit.wikimedia.org/r/713451 (https://phabricator.wikimedia.org/T288810) (owner: 10Hnowlan) [14:14:32] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 112 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:15:16] jouncebot: now [14:15:16] No deployments scheduled for the next 1 hour(s) and 44 minute(s) [14:15:33] (03CR) 10Kormat: [C: 03+2] ProductionServices: Promote pc2012 to primary of pc2. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713466 (https://phabricator.wikimedia.org/T284825) (owner: 10Kormat) [14:16:26] (03Merged) 10jenkins-bot: ProductionServices: Promote pc2012 to primary of pc2. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713466 (https://phabricator.wikimedia.org/T284825) (owner: 10Kormat) [14:20:38] !log kormat@deploy1002 Synchronized wmf-config/ProductionServices.php: Promote pc2012 to primary of pc2 T284825 (duration: 00m 59s) [14:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:48] T284825: Productionize pc2011-pc2014 and pc1011-pc1014 - https://phabricator.wikimedia.org/T284825 [14:21:41] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [14:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:52] (03CR) 10Andrew Bogott: [C: 03+2] nova-fullstack: add checking of dns ptr records [puppet] - 10https://gerrit.wikimedia.org/r/713475 (https://phabricator.wikimedia.org/T288854) (owner: 10Andrew Bogott) [14:25:06] !log running a full testwiki media backup on a single thread, single worker T262668 [14:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:14] T262668: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 [14:25:20] ^ godog [14:25:30] (03PS2) 10Hnowlan: restbase: set lower check_disk thresholds for instance-data volume [puppet] - 10https://gerrit.wikimedia.org/r/711135 (https://phabricator.wikimedia.org/T191659) [14:26:35] ^ and I guess a heads up to Emperor will be also be nice :-) [14:27:29] jynus: ack, thanks for the heads up [14:27:56] (03CR) 10Marostegui: [C: 03+1] ProductionServices: Promote pc2013 to primary of pc3. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713471 (https://phabricator.wikimedia.org/T284825) (owner: 10Kormat) [14:27:57] I have to do it really bad if I am breaking things already :-D [14:28:01] (03PS2) 10Ladsgroup: wdqs: Avoid add trailing slash for querybuilder [puppet] - 10https://gerrit.wikimedia.org/r/713454 (https://phabricator.wikimedia.org/T266703) [14:28:14] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/713454 (https://phabricator.wikimedia.org/T266703) (owner: 10Ladsgroup) [14:29:14] * Emperor twitches [14:31:02] (03CR) 10Marostegui: [C: 03+1] mediabackups: Add mysql grants for mediabackups [puppet] - 10https://gerrit.wikimedia.org/r/712993 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [14:31:54] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 33 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:32:02] (03CR) 10Ladsgroup: "PCC https://puppet-compiler.wmflabs.org/compiler1001/884/miscweb1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/713454 (https://phabricator.wikimedia.org/T266703) (owner: 10Ladsgroup) [14:35:40] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Cookbooks repository: avoid stale code in master branch - https://phabricator.wikimedia.org/T287465 (10joanna_borun) Thank you all for such an active conversation. Myself and Nicholas would like to follow up on this thread and sum up... [14:35:54] (03CR) 10Kormat: [C: 03+2] ProductionServices: Promote pc2013 to primary of pc3. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713471 (https://phabricator.wikimedia.org/T284825) (owner: 10Kormat) [14:36:35] (03Merged) 10jenkins-bot: ProductionServices: Promote pc2013 to primary of pc3. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713471 (https://phabricator.wikimedia.org/T284825) (owner: 10Kormat) [14:37:58] !log kormat@deploy1002 Synchronized wmf-config/ProductionServices.php: Promote pc2013 to primary of pc3 T284825 (duration: 00m 58s) [14:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:06] T284825: Productionize pc2011-pc2014 and pc1011-pc1014 - https://phabricator.wikimedia.org/T284825 [14:38:53] (03PS1) 10Andrew Bogott: nova-fullstack: rearrange dig -x args order [puppet] - 10https://gerrit.wikimedia.org/r/713478 (https://phabricator.wikimedia.org/T288854) [14:39:51] (03CR) 10Andrew Bogott: [C: 03+2] nova-fullstack: rearrange dig -x args order [puppet] - 10https://gerrit.wikimedia.org/r/713478 (https://phabricator.wikimedia.org/T288854) (owner: 10Andrew Bogott) [14:40:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:32] jynus: are the reads in eqiad ? if so feel free to crank it up too [14:43:55] nope, codfw, I don't have yet all servers on eqiad [14:44:14] ack, got it [14:49:25] (03PS1) 10Andrew Bogott: nova-fullstack: strip terminal '.' from dig -x response before comparing [puppet] - 10https://gerrit.wikimedia.org/r/713479 (https://phabricator.wikimedia.org/T288854) [14:50:45] (03PS2) 10Andrew Bogott: nova-fullstack: strip terminal '.' from dig -x response before comparing [puppet] - 10https://gerrit.wikimedia.org/r/713479 (https://phabricator.wikimedia.org/T288854) [14:51:32] 10SRE, 10Proton, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): Proton metrics broken - https://phabricator.wikimedia.org/T277857 (10MSantos) @Jgiannelos the board looks good for me. I think we should change the primary one with these changes. [14:52:17] (03CR) 10Andrew Bogott: [C: 03+2] nova-fullstack: strip terminal '.' from dig -x response before comparing [puppet] - 10https://gerrit.wikimedia.org/r/713479 (https://phabricator.wikimedia.org/T288854) (owner: 10Andrew Bogott) [14:53:46] PROBLEM - Check no envoy runtime configuration is left persistent on thanos-fe1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 434 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [14:55:39] ooof, that's me, will silence [15:01:50] (03PS1) 10Vgutierrez: varnish: Handle UDS traffic properly [puppet] - 10https://gerrit.wikimedia.org/r/713482 [15:05:25] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [15:06:06] (03CR) 10Ladsgroup: mediawiki: Migrate wikidatawiki dispatch crons to three systemd timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710520 (https://phabricator.wikimedia.org/T288175) (owner: 10Ladsgroup) [15:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:52] PROBLEM - Host wdqs1013 is DOWN: PING CRITICAL - Packet loss = 100% [15:10:00] RECOVERY - Host wdqs1013 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [15:14:47] (03CR) 10Ahmon Dancy: [C: 03+1] httpbb: Remove the failing K.A.Z test pending investigation. [puppet] - 10https://gerrit.wikimedia.org/r/713375 (https://phabricator.wikimedia.org/T289022) (owner: 10RLazarus) [15:23:03] 10SRE, 10ops-eqiad: Degraded RAID on cloudcephosd1008 - https://phabricator.wikimedia.org/T289069 (10ops-monitoring-bot) [15:23:07] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [15:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:26] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10Epic, 10Goal: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) ` root@db2151.codfw.wmnet[mediabackups]> select backup_status, backup_status_name as `backup_status`, count(*) as `#` FROM files JO... [15:32:32] !log zpapierski@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [15:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:22] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:34:28] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10Epic, 10Goal: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) Storage view: {F34598388} {F34598387} Now time to check logs, errors to see if we need some fixes or the errors come from the or... [15:36:20] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:36:44] (03CR) 10Daimona Eaytoy: [C: 03+1] Revert "Hide disambiguator-link-added tag temporarily" [extensions/Disambiguator] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713355 (owner: 10MusikAnimal) [15:36:52] (03CR) 10Daimona Eaytoy: [C: 03+1] Apply the disambiguator-link-added tag in onRecentChange_save [extensions/Disambiguator] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713357 (https://phabricator.wikimedia.org/T287549) (owner: 10MusikAnimal) [15:36:56] (03CR) 10MusikAnimal: [C: 03+2] Revert "Hide disambiguator-link-added tag temporarily" [extensions/Disambiguator] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713355 (owner: 10MusikAnimal) [15:37:02] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [15:37:07] (03CR) 10MusikAnimal: [C: 03+2] Apply the disambiguator-link-added tag in onRecentChange_save [extensions/Disambiguator] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713357 (https://phabricator.wikimedia.org/T287549) (owner: 10MusikAnimal) [15:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:58] (03PS1) 10David Caro: wmcs.node_pinger: Don't fail when ping fails, return -1 [puppet] - 10https://gerrit.wikimedia.org/r/713483 [15:43:57] (03CR) 10Andrew Bogott: [C: 03+1] wmcs.node_pinger: Don't fail when ping fails, return -1 [puppet] - 10https://gerrit.wikimedia.org/r/713483 (owner: 10David Caro) [15:44:05] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:44:21] (03CR) 10David Caro: [C: 03+2] wmcs.node_pinger: Don't fail when ping fails, return -1 [puppet] - 10https://gerrit.wikimedia.org/r/713483 (owner: 10David Caro) [15:44:54] 10SRE, 10ops-eqiad: Degraded RAID on cloudcephosd1008 - https://phabricator.wikimedia.org/T289073 (10ops-monitoring-bot) [15:45:51] RECOVERY - mailman archives on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 28 Oct 2021 09:00:44 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:47:22] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10Epic, 10Goal: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) Files are more or less equally distributed among storage hosts: ` root@db2151.codfw.wmnet[mediabackups]> select location, endpoint... [15:47:50] jouncebot: now [15:47:51] No deployments scheduled for the next 0 hour(s) and 12 minute(s) [15:47:53] jouncebot: next [15:47:54] In 0 hour(s) and 12 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210817T1600) [15:48:10] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudcephosd1008 - https://phabricator.wikimedia.org/T287838 (10Cmjohnson) The issue was the raid controller took the disk and it was raid-capable. I booted into the raid-bios, changed the disk to non-raid, and power cycled. Back at the OS,... [15:50:20] (03CR) 10Effie Mouzeli: [C: 03+2] wdqs: Avoid add trailing slash for querybuilder [puppet] - 10https://gerrit.wikimedia.org/r/713454 (https://phabricator.wikimedia.org/T266703) (owner: 10Ladsgroup) [15:50:23] 10SRE, 10SRE-swift-storage: Cannot delete "File:The Chorisettes tmp.jpg" on Commons: Error deleting file: An unknown error occurred in storage backend "local-multiwrite" - https://phabricator.wikimedia.org/T288968 (10Billinghurst) [15:50:38] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 3 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Billinghurst) [15:51:48] * urbanecm is deploying a secpatch [15:55:33] !log Deploy a security patch for T289064 [15:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:35] RECOVERY - Host backup1006 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [15:57:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: hw troubleshooting: system right cp board missing in new host backup1006 - https://phabricator.wikimedia.org/T286625 (10Cmjohnson) I pulled the system out of the rack, checked the seating for all connections. Cleared the log and powercycled. Let's s... [15:57:35] (03PS1) 10ZPapierski: Revert "Test - verify task manager count impact" [deployment-charts] - 10https://gerrit.wikimedia.org/r/713485 [15:57:49] RECOVERY - Host backup1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.09 ms [15:58:00] 10SRE-swift-storage, 10Maps, 10serviceops: Tegola staging doesn't connect to swift - https://phabricator.wikimedia.org/T289076 (10Jgiannelos) [15:58:07] (03PS2) 10Vgutierrez: varnish: Handle UDS traffic properly [puppet] - 10https://gerrit.wikimedia.org/r/713482 [15:58:16] (03Merged) 10jenkins-bot: Revert "Hide disambiguator-link-added tag temporarily" [extensions/Disambiguator] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713355 (owner: 10MusikAnimal) [15:58:18] (03Merged) 10jenkins-bot: Apply the disambiguator-link-added tag in onRecentChange_save [extensions/Disambiguator] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713357 (https://phabricator.wikimedia.org/T287549) (owner: 10MusikAnimal) [15:58:39] 10SRE, 10ops-eqiad: Degraded RAID on cloudcephosd1008 - https://phabricator.wikimedia.org/T289073 (10Cmjohnson) 05Open→03Declined duplicate [15:59:22] 10SRE, 10ops-eqiad: Degraded RAID on cloudcephosd1008 - https://phabricator.wikimedia.org/T289069 (10Cmjohnson) 05Open→03Declined duplicate https://phabricator.wikimedia.org/T287838 [16:00:04] jbond and rzl: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210817T1600). [16:02:02] (03PS1) 10Bstorm: wikireplicas: add new manual config logic further down as well [puppet] - 10https://gerrit.wikimedia.org/r/713486 (https://phabricator.wikimedia.org/T287442) [16:02:21] (03CR) 10Effie Mouzeli: [C: 03+2] Revert "Test - verify task manager count impact" [deployment-charts] - 10https://gerrit.wikimedia.org/r/713485 (owner: 10ZPapierski) [16:02:41] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10decommission-hardware, 10netops: Decommission asw-c-eqiad - https://phabricator.wikimedia.org/T208734 (10Cmjohnson) 05Open→03Resolved Removed the serial console port (port 9) for old asw-c1.resolving [16:03:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:44] (03Merged) 10jenkins-bot: Revert "Test - verify task manager count impact" [deployment-charts] - 10https://gerrit.wikimedia.org/r/713485 (owner: 10ZPapierski) [16:05:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:11] !log zpapierski@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [16:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:37] (03PS1) 10Ladsgroup: Revert "objectcache: make use of new `modtoken` field in SqlBagOStuff" [core] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713365 (https://phabricator.wikimedia.org/T288998) [16:07:56] (03PS1) 10Ladsgroup: Revert "objectcache: make use of new `modtoken` field in SqlBagOStuff" [core] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713506 (https://phabricator.wikimedia.org/T288998) [16:08:33] !log zpapierski@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [16:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:45] (03CR) 10Bstorm: [C: 03+2] wikireplicas: add new manual config logic further down as well [puppet] - 10https://gerrit.wikimedia.org/r/713486 (https://phabricator.wikimedia.org/T287442) (owner: 10Bstorm) [16:10:30] 10SRE, 10serviceops: mcrouter crashing on mwmaint2002 - https://phabricator.wikimedia.org/T288787 (10jijiki) 05Open→03Resolved https://gerrit.wikimedia.org/r/712920 is merged, closing this. [16:11:05] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Switch failure: asw2-a8-eqiad Aug 13th 2021 - https://phabricator.wikimedia.org/T288834 (10Cmjohnson) @ayounsi Did the fiber swap fix the issue or do we need to change optics? [16:12:11] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10Epic, 10Goal: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) This is the view from the filesystem and with the mc command line client: {F34598421} {F34598420} [16:13:21] !log 1.37.0-wmf.19 train: running scap prep, branched at 79c9b9e61350b0edd1acccb5e717875ba64cf9c1 [16:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:38] 10SRE, 10ops-codfw: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 (10jijiki) @Papaul I can put the server back in production, but when the server was active, we didn't get any kind of logs as to which CPU is problematic, if it is not both. I will put it back in prod tomorrow my morning, and... [16:15:12] 10SRE, 10Security, 10User-razzi: Cookbook to reboot cassandra nodes - https://phabricator.wikimedia.org/T288975 (10razzi) [16:16:23] (03PS3) 10Hnowlan: restbase: set lower check_disk thresholds for instance-data volume [puppet] - 10https://gerrit.wikimedia.org/r/711135 (https://phabricator.wikimedia.org/T191659) [16:16:39] (03PS1) 10Brennen Bearnes: testwikis wikis to 1.37.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713487 [16:16:41] (03CR) 10Brennen Bearnes: [C: 03+2] testwikis wikis to 1.37.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713487 (owner: 10Brennen Bearnes) [16:17:47] (03Merged) 10jenkins-bot: testwikis wikis to 1.37.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713487 (owner: 10Brennen Bearnes) [16:17:51] !log brennen@deploy1002 Started scap: testwikis wikis to 1.37.0-wmf.19 [16:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:37] 10SRE-swift-storage, 10Maps, 10serviceops: Tegola staging doesn't connect to swift - https://phabricator.wikimedia.org/T289076 (10Jgiannelos) According to grafana, swift connections are failing since Aug 13 https://grafana.wikimedia.org/goto/FjxPVpn7k [16:19:25] (03PS1) 10Ahmon Dancy: httpbb: Add check for https://en.wikipedia.org/favicon.ico [puppet] - 10https://gerrit.wikimedia.org/r/713488 (https://phabricator.wikimedia.org/T285232) [16:19:44] jouncebot: now [16:19:45] For the next 0 hour(s) and 40 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210817T1600) [16:19:49] jouncebot: next [16:19:49] In 0 hour(s) and 40 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210817T1700) [16:20:06] okay, deploying the backport [16:20:19] (03CR) 10Ladsgroup: [C: 03+2] Revert "objectcache: make use of new `modtoken` field in SqlBagOStuff" [core] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713365 (https://phabricator.wikimedia.org/T288998) (owner: 10Ladsgroup) [16:20:22] (03CR) 10Ladsgroup: [C: 03+2] Revert "objectcache: make use of new `modtoken` field in SqlBagOStuff" [core] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713506 (https://phabricator.wikimedia.org/T288998) (owner: 10Ladsgroup) [16:20:56] Amir1: brennen currently runs full scap, I'm not sure it's the best time for a deployment. [16:21:26] hmm, it's puppet window atm [16:21:44] I know -- see the !_log a few lines above though [16:21:45] how long is left? [16:21:53] Amir1: apologies, i didn't see anything in deployment schedule [16:21:59] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30603/console" [puppet] - 10https://gerrit.wikimedia.org/r/711135 (https://phabricator.wikimedia.org/T191659) (owner: 10Hnowlan) [16:22:13] this... will take a while. it's initial sync to testwikis. [16:22:16] brennen: my fault, I thought the window is empty :D [16:22:54] so I stop the merge now, let me know once it's mostly done (jenkins take twenty minutes) [16:23:04] Amir1: will ping. [16:23:09] (03CR) 10Ladsgroup: [C: 03+1] "Waiting for a bit" [core] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713365 (https://phabricator.wikimedia.org/T288998) (owner: 10Ladsgroup) [16:23:56] (03CR) 10Ladsgroup: [C: 03+1] "will wait for deployment atm to finish" [core] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713506 (https://phabricator.wikimedia.org/T288998) (owner: 10Ladsgroup) [16:25:43] !log dns3001: upgrade gdnsd package to 3.8.0-1~wmf1 [16:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:25] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools' topicsubscription as beta feature on phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713491 (https://phabricator.wikimedia.org/T287800) [16:30:53] 10SRE, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Shrink redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10jijiki) @kostajh @krinkle, I would like to move this task forward. My plan is to remove one or two redis shard(s) per day, until we have 8 left. Right now the size of... [16:36:08] 10SRE, 10ops-eqiad, 10DC-Ops: document all scs connections - https://phabricator.wikimedia.org/T175876 (10Cmjohnson) scs-c1 cable labels updated [16:36:19] (03PS1) 10Effie Mouzeli: hieradata: remove extra key from mediawiki/memcached.yaml [puppet] - 10https://gerrit.wikimedia.org/r/713492 [16:43:30] 10SRE, 10LDAP-Access-Requests: Access request to superset for user natalia-rodriguez - https://phabricator.wikimedia.org/T285436 (10NRodriguez) Thank you so much! [16:44:20] (03PS1) 10Jgiannelos: tegola-vector-tiles: Fix the location for swift s3api [deployment-charts] - 10https://gerrit.wikimedia.org/r/713493 (https://phabricator.wikimedia.org/T289076) [16:44:58] (03PS1) 10Effie Mouzeli: site: Install memcached on new memcached servers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/713494 (https://phabricator.wikimedia.org/T278225) [16:47:17] 10SRE-swift-storage, 10Maps, 10serviceops, 10Patch-For-Review: Tegola staging doesn't connect to swift - https://phabricator.wikimedia.org/T289076 (10Jgiannelos) I guess its relevant to the bullseye upgrade: buster version: https://github.com/openstack/swift/blob/2.19.1/swift/common/middleware/s3api/s3api... [16:48:09] (03PS1) 10Bstorm: wikireplicas: remove old code for supporting monolithic replicas [puppet] - 10https://gerrit.wikimedia.org/r/713495 [16:49:07] (03PS1) 10MSantos: mobileapps: bump to 2021-08-12-112530-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/713496 [16:50:36] (03CR) 10Eevans: [C: 03+1] "Nice; I like that this will generate a more specific alert (complete with link to more info)!" [puppet] - 10https://gerrit.wikimedia.org/r/711135 (https://phabricator.wikimedia.org/T191659) (owner: 10Hnowlan) [16:50:55] !log dns1001: upgrade gdnsd package to 3.8.0-1~wmf1 [16:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:11] (03CR) 10Bstorm: "To reviewers: Besides checking that my f-strings are the same as the format() versions, the config["instances"] == "all" bit should not ex" [puppet] - 10https://gerrit.wikimedia.org/r/713495 (owner: 10Bstorm) [16:54:24] 10SRE, 10Wikidata, 10Wikidata Query Builder, 10wdwb-tech, and 4 others: Deploy query builder to microsites (on top of the wdqs-ui) - https://phabricator.wikimedia.org/T266703 (10Ladsgroup) [16:56:16] !log brennen@deploy1002 Finished scap: testwikis wikis to 1.37.0-wmf.19 (duration: 38m 24s) [16:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:26] (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 2021-08-12-112530-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/713496 (owner: 10MSantos) [17:00:05] chrisalbon and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210817T1700). [17:00:09] Amir1: testwiki sync finished a bit faster than expected. [17:00:15] (03Merged) 10jenkins-bot: mobileapps: bump to 2021-08-12-112530-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/713496 (owner: 10MSantos) [17:00:22] magic :D [17:00:50] (03CR) 10Ladsgroup: [C: 03+2] "deploying now" [core] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713365 (https://phabricator.wikimedia.org/T288998) (owner: 10Ladsgroup) [17:00:55] (03CR) 10Ladsgroup: [C: 03+2] Revert "objectcache: make use of new `modtoken` field in SqlBagOStuff" [core] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713506 (https://phabricator.wikimedia.org/T288998) (owner: 10Ladsgroup) [17:02:42] !log mbsantos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [17:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:24] (03PS1) 10Ladsgroup: miscweb: Add query builder to monitoring probes [puppet] - 10https://gerrit.wikimedia.org/r/713497 (https://phabricator.wikimedia.org/T266703) [17:04:27] !log mbsantos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [17:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:34] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Switch failure: asw2-a8-eqiad Aug 13th 2021 - https://phabricator.wikimedia.org/T288834 (10cmooney) Hey @Cmjohnson. Link has been clean since, but it often goes a few weeks without issues, so we probably want to wait just to be sure. So far so g... [17:07:16] !log mbsantos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [17:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:41] PROBLEM - Host cloudcephosd1014.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:12:37] 10SRE, 10serviceops, 10Patch-For-Review: Create a mediawiki::cronjob define - https://phabricator.wikimedia.org/T211250 (10RLazarus) [17:12:43] 10Puppet, 10Infrastructure-Foundations, 10Wikidata, 10wdwb-tech, and 2 others: Migrate wikibase-dispatch-changes crons to systemd timers - https://phabricator.wikimedia.org/T288175 (10RLazarus) [17:15:17] !log authdns2001,dns[245]001: upgrade gdnsd package to 3.8.0-1~wmf1 (all authdns upgraded after this) [17:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:31] (03CR) 10RLazarus: [C: 03+2] mediawiki: Migrate wikidatawiki dispatch crons to three systemd timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710520 (https://phabricator.wikimedia.org/T288175) (owner: 10Ladsgroup) [17:18:14] (03Merged) 10jenkins-bot: Revert "objectcache: make use of new `modtoken` field in SqlBagOStuff" [core] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713365 (https://phabricator.wikimedia.org/T288998) (owner: 10Ladsgroup) [17:18:36] (03Merged) 10jenkins-bot: Revert "objectcache: make use of new `modtoken` field in SqlBagOStuff" [core] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713506 (https://phabricator.wikimedia.org/T288998) (owner: 10Ladsgroup) [17:23:25] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:27:08] (03CR) 10Dzahn: [C: 03+2] miscweb: Add query builder to monitoring probes [puppet] - 10https://gerrit.wikimedia.org/r/713497 (https://phabricator.wikimedia.org/T266703) (owner: 10Ladsgroup) [17:31:23] PROBLEM - Ensure local MW versions match expected deployment on mw2383 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [17:32:58] (03CR) 10RLazarus: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/713488 (https://phabricator.wikimedia.org/T285232) (owner: 10Ahmon Dancy) [17:35:54] (03PS2) 10RLazarus: httpbb: Remove the failing K.A.Z test pending investigation. [puppet] - 10https://gerrit.wikimedia.org/r/713375 (https://phabricator.wikimedia.org/T289022) [17:39:10] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.19/includes/: Backport: [[gerrit:713365|Revert "objectcache: make use of new `modtoken` field in SqlBagOStuff" (T288998)]] (duration: 01m 14s) [17:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:19] T288998: Significant ParserCache space increase after 2021-08-12 (1.37.0-wmf.18 regression) - https://phabricator.wikimedia.org/T288998 [17:41:51] !log [urbanecm@mw2383 ~]$ scap pull # to clear an icinga alert [17:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:14] 10SRE, 10ops-codfw, 10DC-Ops: codfw: Netbox Error - https://phabricator.wikimedia.org/T288586 (10wiki_willy) Got it, thanks for the info @Papaul ! [17:43:15] RECOVERY - Ensure local MW versions match expected deployment on mw2383 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [17:43:43] thanks urbanecm [17:44:00] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.18/includes/: Backport: [[gerrit:713506|Revert "objectcache: make use of new `modtoken` field in SqlBagOStuff" (T288998)]] (duration: 01m 13s) [17:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:15] np [17:49:29] (03PS1) 10Ladsgroup: mediawiki: Drop absented crons in wikidata.pp [puppet] - 10https://gerrit.wikimedia.org/r/713501 (https://phabricator.wikimedia.org/T288175) [17:49:51] (03CR) 10RLazarus: [C: 03+2] httpbb: Remove the failing K.A.Z test pending investigation. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713375 (https://phabricator.wikimedia.org/T289022) (owner: 10RLazarus) [17:52:03] (03CR) 10RLazarus: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/713501 (https://phabricator.wikimedia.org/T288175) (owner: 10Ladsgroup) [18:00:05] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210817T1800) [18:03:29] (03PS1) 10Ladsgroup: mediawiki: Absent logrotate for wikidata [puppet] - 10https://gerrit.wikimedia.org/r/713502 (https://phabricator.wikimedia.org/T288175) [18:03:31] (03PS1) 10Ladsgroup: mediawiki: Drop logrotate config for wikidata [puppet] - 10https://gerrit.wikimedia.org/r/713503 (https://phabricator.wikimedia.org/T288175) [18:06:16] 10Puppet, 10Infrastructure-Foundations, 10Wikidata, 10wdwb-tech, and 2 others: Migrate wikibase-dispatch-changes crons to systemd timers - https://phabricator.wikimedia.org/T288175 (10Ladsgroup) Maybe after a while we should delete the old logs, I can put a reminder to delete them in three months. [18:07:52] (03PS2) 10Ladsgroup: mediawiki: Absent logrotate for wikidata [puppet] - 10https://gerrit.wikimedia.org/r/713502 (https://phabricator.wikimedia.org/T288175) [18:07:54] (03PS2) 10Ladsgroup: mediawiki: Drop logrotate config for wikidata [puppet] - 10https://gerrit.wikimedia.org/r/713503 (https://phabricator.wikimedia.org/T288175) [18:08:18] 10SRE-swift-storage, 10envoy, 10serviceops: Envoy and swift HEAD with 204 response turns into 503 - https://phabricator.wikimedia.org/T288815 (10RLazarus) a:03RLazarus [18:10:02] (03PS1) 10RLazarus: envoyproxy: Add $runtime field to set a static runtime layer. [puppet] - 10https://gerrit.wikimedia.org/r/713504 (https://phabricator.wikimedia.org/T288815) [18:11:37] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30607/console" [puppet] - 10https://gerrit.wikimedia.org/r/713504 (https://phabricator.wikimedia.org/T288815) (owner: 10RLazarus) [18:17:19] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/713503 (https://phabricator.wikimedia.org/T288175) (owner: 10Ladsgroup) [18:17:28] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/713502 (https://phabricator.wikimedia.org/T288175) (owner: 10Ladsgroup) [18:20:18] (03PS3) 10Ladsgroup: mediawiki: Absent logrotate for wikidata [puppet] - 10https://gerrit.wikimedia.org/r/713502 (https://phabricator.wikimedia.org/T288175) [18:20:20] (03PS3) 10Ladsgroup: mediawiki: Drop logrotate config for wikidata [puppet] - 10https://gerrit.wikimedia.org/r/713503 (https://phabricator.wikimedia.org/T288175) [18:24:13] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:24:48] (03CR) 10RLazarus: "Hugh - are you comfortable reviewing this as an Envoy config change? (CCing Joe in case he wants to take a look when he's back from vacati" [puppet] - 10https://gerrit.wikimedia.org/r/713504 (https://phabricator.wikimedia.org/T288815) (owner: 10RLazarus) [18:26:33] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 107 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:30:27] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 47 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:00:04] brennen and jeena: May I have your attention please! MediaWiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210817T1900) [19:01:45] PROBLEM - Host wdqs1013 is DOWN: PING CRITICAL - Packet loss = 100% [19:02:09] !log 1.37.0-wmf.19 train status: no current blockers, proceeding to group0 (T281160) [19:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:20] T281160: 1.37.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T281160 [19:02:23] RECOVERY - Host wdqs1013 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [19:02:48] (03PS1) 10Brennen Bearnes: group0 wikis to 1.37.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713527 [19:02:50] (03CR) 10Brennen Bearnes: [C: 03+2] group0 wikis to 1.37.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713527 (owner: 10Brennen Bearnes) [19:03:42] (03Merged) 10jenkins-bot: group0 wikis to 1.37.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713527 (owner: 10Brennen Bearnes) [19:04:23] 10SRE, 10Goal, 10Patch-For-Review: FY2020-2021 Q1 DC switchover and switchback - https://phabricator.wikimedia.org/T243314 (10RLazarus) [19:04:51] 10SRE, 10serviceops, 10Patch-For-Review: Create a mediawiki::cronjob define - https://phabricator.wikimedia.org/T211250 (10RLazarus) 05Open→03Resolved a:03RLazarus 🎉 With the dispatcher jobs migrated to systemd timers today, this is done! There are no maintenance cronjobs left. ` rzl@mwmaint2002:~$ s... [19:05:38] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.37.0-wmf.19 [19:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:35] 10SRE, 10SRE-tools, 10Spicerack, 10serviceops, 10Datacenter-Switchover: Clean up cron-specific elements of switchdc cookbooks - https://phabricator.wikimedia.org/T289078 (10RLazarus) [19:18:37] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 105 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:20:09] 10SRE, 10MediaWiki-Cache, 10Platform Engineering, 10Wikidata, and 3 others: APCu caches are set to expire in 2073 instead of an hour if exptime is a unix timestamp - https://phabricator.wikimedia.org/T286260 (10Krinkle) [19:20:13] 10SRE, 10Datacenter-Switchover: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 (10RLazarus) [19:20:21] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, and 2 others: Clean up cron-specific elements of switchdc cookbooks - https://phabricator.wikimedia.org/T289078 (10RLazarus) [19:20:30] 10SRE, 10MediaWiki-Cache, 10Platform Engineering, 10Wikidata, and 3 others: APCu caches are set to expire in 2073 instead of an hour if exptime is a unix timestamp - https://phabricator.wikimedia.org/T286260 (10Krinkle) [19:20:44] 10SRE, 10MediaWiki-Cache, 10Platform Engineering, 10Wikidata, and 3 others: APCu caches are set to expire in 2073 instead of an hour if exptime is a unix timestamp - https://phabricator.wikimedia.org/T286260 (10Krinkle) 05duplicate→03Resolved p:05Triage→03High [19:22:25] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 20 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:36:14] (03PS1) 10Urbanecm: BlockUser: Restore blocking autoblocked IP addresses [core] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713507 (https://phabricator.wikimedia.org/T287798) [19:36:27] jouncebot: now [19:36:27] For the next 1 hour(s) and 23 minute(s): MediaWiki train - American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210817T1900) [19:37:27] urbanecm: train seems stable currently if you are needing to do a backport. [19:37:37] thanks brennen, that's helpful [19:37:41] (03CR) 10Urbanecm: [C: 03+2] BlockUser: Restore blocking autoblocked IP addresses [core] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713507 (https://phabricator.wikimedia.org/T287798) (owner: 10Urbanecm) [19:37:57] urbanecm: please ping me when done; i need to delete a couple of old releases. [19:38:43] brennen: I have to wait on CI anyway -- not sure how long deleting old releases usually takes, but if less than CI, feel free to go ahead. [19:39:36] it can take a bit. i'll wait. :) [19:39:44] okay, up to you :) [19:39:54] i'll ping you then [19:39:59] (03PS1) 10RLazarus: mediawiki: Remove cron-specific maintenance implementation details [software/spicerack] - 10https://gerrit.wikimedia.org/r/713530 (https://phabricator.wikimedia.org/T289078) [19:39:59] thx [19:43:53] (03PS1) 10RLazarus: 08-start-maintenance: Remove cron-specific maintenance implementation details [cookbooks] - 10https://gerrit.wikimedia.org/r/713532 (https://phabricator.wikimedia.org/T289078) [19:45:49] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: Remove cron-specific maintenance implementation details [software/spicerack] - 10https://gerrit.wikimedia.org/r/713530 (https://phabricator.wikimedia.org/T289078) (owner: 10RLazarus) [19:50:49] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudcephosd1008 - https://phabricator.wikimedia.org/T287838 (10Cmjohnson) 05Open→03Resolved cmjohnson@cloudcephosd1008:~$ sudo cat /proc/mdstat Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] md0... [19:56:56] (03Merged) 10jenkins-bot: BlockUser: Restore blocking autoblocked IP addresses [core] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713507 (https://phabricator.wikimedia.org/T287798) (owner: 10Urbanecm) [19:57:01] o/ [19:57:47] * urbanecm finding an autoblocked IP to test this on [20:03:42] (03PS2) 10RLazarus: mediawiki: Remove cron-specific maintenance implementation details [software/spicerack] - 10https://gerrit.wikimedia.org/r/713530 (https://phabricator.wikimedia.org/T289078) [20:05:45] syncing [20:06:49] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.18/includes/block/BlockUser.php: d377d4fae704640c81172a6fa94b12b2efdba42c: BlockUser: Restore blocking autoblocked IP addresses (T287798) (duration: 01m 08s) [20:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:00] T287798: Unable to block an autoblocked IP address - https://phabricator.wikimedia.org/T287798 [20:08:00] brennen: I'm done. Thanks for letting me use your window :). [20:08:55] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: Remove cron-specific maintenance implementation details [software/spicerack] - 10https://gerrit.wikimedia.org/r/713530 (https://phabricator.wikimedia.org/T289078) (owner: 10RLazarus) [20:11:29] urbanecm: you bet - thanks [20:11:34] jouncebot now [20:11:34] For the next 0 hour(s) and 48 minute(s): MediaWiki train - American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210817T1900) [20:11:38] jouncebot next [20:11:38] In 2 hour(s) and 48 minute(s): Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210817T2300) [20:11:44] (03PS3) 10RLazarus: mediawiki: Remove cron-specific maintenance implementation details [software/spicerack] - 10https://gerrit.wikimedia.org/r/713530 (https://phabricator.wikimedia.org/T289078) [20:14:04] !log pruning 1.37.0-wmf.15 and .16 (T281160) [20:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:12] T281160: 1.37.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T281160 [20:16:52] (03PS1) 10Btullis: Add a dummy SSH key pair for alluxio in the nest cluster [labs/private] - 10https://gerrit.wikimedia.org/r/713535 (https://phabricator.wikimedia.org/T266641) [20:17:29] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: Remove cron-specific maintenance implementation details [software/spicerack] - 10https://gerrit.wikimedia.org/r/713530 (https://phabricator.wikimedia.org/T289078) (owner: 10RLazarus) [20:17:31] PROBLEM - Ensure local MW versions match expected deployment on mw2383 is CRITICAL: CRITICAL: 128 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [20:19:40] (03PS4) 10RLazarus: mediawiki: Remove cron-specific maintenance implementation details [software/spicerack] - 10https://gerrit.wikimedia.org/r/713530 (https://phabricator.wikimedia.org/T289078) [20:19:51] brennen: ^^that's the same host i ran scap pull earlier today^^ [20:19:59] wondering why it is out of sync (again) [20:20:12] yeah, odd [20:20:14] !log brennen@deploy1002 Pruned MediaWiki: 1.37.0-wmf.15 (duration: 06m 51s) [20:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:18] urbanecm: T286463 [20:23:18] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add a dummy SSH key pair for alluxio in the nest cluster [labs/private] - 10https://gerrit.wikimedia.org/r/713535 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [20:23:19] T286463: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 [20:24:52] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: Remove cron-specific maintenance implementation details [software/spicerack] - 10https://gerrit.wikimedia.org/r/713530 (https://phabricator.wikimedia.org/T289078) (owner: 10RLazarus) [20:26:52] pruning .16 then will run `scap pull` there. [20:29:01] !log brennen@deploy1002 Pruned MediaWiki: 1.37.0-wmf.16 (duration: 02m 01s) [20:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:26] !log running scap pull on mw2383 [20:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:23] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 109 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:35:01] RECOVERY - Ensure local MW versions match expected deployment on mw2383 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [20:39:03] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 108 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:40:33] mw2383 is depooled from puppet, so in my understanding it's not syncing by itself. [20:54:33] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 104 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:02:21] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 23 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:05:36] ^ assuming above is unrelated to train. [21:14:35] zabe: good catch -- it's completely depooled. In that case icinga presumably just should ignore that alert [21:15:53] 10SRE, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Shrink redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10Krinkle) Good to go from both of us. Last time we did maintenance (T252391) it was realized that the instrumentation that relies on the stronger persistence was no lo... [21:17:04] (03CR) 10Herron: "Thanks for this! Please see some initial thoughts/questions inline." [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/713440 (https://phabricator.wikimedia.org/T289036) (owner: 10Ema) [21:17:15] 10SRE, 10ops-codfw: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 (10Papaul) Thank you [21:38:15] jouncebot: now [21:38:16] No deployments scheduled for the next 1 hour(s) and 21 minute(s) [21:38:23] * urbanecm deploying a security patch [21:39:24] (03PS1) 10Ebernhardson: Hack around i18n cache failure for wikitech [extensions/CirrusSearch] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713508 (https://phabricator.wikimedia.org/T288233) [21:44:06] !log Deploy security patch for T289063 [21:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:27] (03PS1) 10Kosta Harlan: WikimediaEvents: Remove UnderstandingFirstDay config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713553 [21:48:44] * urbanecm done [21:51:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:53] (03PS1) 10Cwhite: openstack: add more fields to nova_fullstack_test logging [puppet] - 10https://gerrit.wikimedia.org/r/713559 [22:29:01] (03PS2) 10Cwhite: openstack: add more fields to nova_fullstack_test logging [puppet] - 10https://gerrit.wikimedia.org/r/713559 [22:39:42] (03CR) 10jerkins-bot: [V: 04-1] Hack around i18n cache failure for wikitech [extensions/CirrusSearch] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713508 (https://phabricator.wikimedia.org/T288233) (owner: 10Ebernhardson) [22:43:59] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2059-production-search-omega-codfw on elastic2059 is CRITICAL: 332.5 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-omega-codfw&var-instance=elastic2059&panelId=37 [22:56:15] (03CR) 10Ebernhardson: "recheck" [extensions/CirrusSearch] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713508 (https://phabricator.wikimedia.org/T288233) (owner: 10Ebernhardson) [23:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210817T2300). [23:00:04] ebernhardson: A patch you scheduled for Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:53] (03PS5) 10RLazarus: mediawiki: Remove cron-specific maintenance implementation details [software/spicerack] - 10https://gerrit.wikimedia.org/r/713530 (https://phabricator.wikimedia.org/T289078) [23:05:21] !log resetting email for vanished user [23:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:38] i can ship my patch [23:06:04] (03CR) 10Ebernhardson: [C: 03+2] "test failure was a timeout, unrelated to this patch" [extensions/CirrusSearch] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713508 (https://phabricator.wikimedia.org/T288233) (owner: 10Ebernhardson) [23:07:20] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: Remove cron-specific maintenance implementation details [software/spicerack] - 10https://gerrit.wikimedia.org/r/713530 (https://phabricator.wikimedia.org/T289078) (owner: 10RLazarus) [23:25:50] (03Merged) 10jenkins-bot: Hack around i18n cache failure for wikitech [extensions/CirrusSearch] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713508 (https://phabricator.wikimedia.org/T288233) (owner: 10Ebernhardson) [23:32:08] !log ebernhardson@deploy1002 Synchronized php-1.37.0-wmf.19/extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php: T288233: Work around cache failure for wikitech (duration: 01m 28s) [23:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:16] T288233: Completion index fails to build on labswiki: MessageCache.php: Process cache for 'en' should be set by now - https://phabricator.wikimedia.org/T288233 [23:32:21] (03PS6) 10RLazarus: mediawiki: Remove cron-specific maintenance implementation details [software/spicerack] - 10https://gerrit.wikimedia.org/r/713530 (https://phabricator.wikimedia.org/T289078) [23:33:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:12] all complete with deployment [23:40:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:37] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2059-production-search-omega-codfw on elastic2059 is OK: (C)100 gt (W)80 gt 56.95 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-omega-codfw&var-instance=elastic2059&panelId=37