[00:00:05] twentyafterfour: #bothumor My software never has bugs. It just develops random features. Rise for Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210819T0000). [00:20:26] (03PS4) 10RLazarus: envoyproxy: Add $runtime field to set a static runtime layer. [puppet] - 10https://gerrit.wikimedia.org/r/713504 (https://phabricator.wikimedia.org/T288815) [00:21:48] (03CR) 10RLazarus: envoyproxy: Add $runtime field to set a static runtime layer. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713504 (https://phabricator.wikimedia.org/T288815) (owner: 10RLazarus) [00:24:42] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30697/console" [puppet] - 10https://gerrit.wikimedia.org/r/713504 (https://phabricator.wikimedia.org/T288815) (owner: 10RLazarus) [00:40:15] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [00:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:49] (03PS1) 10RLazarus: thanos::frontend: Disable Envoy's strict 204 header parsing [puppet] - 10https://gerrit.wikimedia.org/r/713725 (https://phabricator.wikimedia.org/T288815) [00:42:40] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:30] (03CR) 10RLazarus: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/30698/" [puppet] - 10https://gerrit.wikimedia.org/r/713725 (https://phabricator.wikimedia.org/T288815) (owner: 10RLazarus) [00:47:49] 10SRE, 10MW-on-K8s, 10serviceops: Make HTTP calls work within mediawiki on kubernetes - https://phabricator.wikimedia.org/T288848 (10Legoktm) >>! In T288848#7292923, @TK-999 wrote: > For the record, to resolve the same issue during our effort to upgrade Fandom's MW-on-k8s deployment, we ended up creating an... [00:52:25] (03PS1) 10Legoktm: [WIP] Re-enable Score with Shellbox on small/medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713726 [00:53:31] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Re-enable Score with Shellbox on small/medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713726 (owner: 10Legoktm) [00:58:31] (03PS2) 10Legoktm: Re-enable Score with Shellbox on most public wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713726 (https://phabricator.wikimedia.org/T257066) [00:58:33] (03PS1) 10Legoktm: Drop $wmgUseScoreShellbox, redundant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713727 [01:34:00] PROBLEM - MariaDB Replica Lag: s4 on db2097 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1215.21 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:58:23] (03PS3) 10Legoktm: Remove putenv() for GDFONTPATH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664670 (https://phabricator.wikimedia.org/T274822) [02:09:34] (03CR) 10Legoktm: Allow protocol for etcd server to be specified (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713704 (owner: 10Ahmon Dancy) [03:13:58] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:15:54] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:10:52] RECOVERY - MariaDB Replica Lag: s4 on db2097 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:20:31] !log pool mw2383 - T286463 [04:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:20:40] T286463: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 [04:21:32] RECOVERY - Ensure local MW versions match expected deployment on mw2383 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [04:57:50] RECOVERY - mediawiki-installation DSH group on mw2383 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [05:03:04] (03CR) 10Gergő Tisza: [C: 03+1] "Looks good but needs to wait until next week, right?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713553 (owner: 10Kosta Harlan) [06:39:58] RECOVERY - Stale file for node-exporter textfile in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Stale_file_for_node-exporter_textfile https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [07:17:47] (03PS1) 10JMeybohm: kubernetes/staging: Reorder hiera keys to match production order [puppet] - 10https://gerrit.wikimedia.org/r/713804 (https://phabricator.wikimedia.org/T289131) [07:17:49] (03PS1) 10JMeybohm: kubernetes/staging: Enable Priority admission plugin in codfw [puppet] - 10https://gerrit.wikimedia.org/r/713805 (https://phabricator.wikimedia.org/T289131) [07:17:51] (03PS1) 10JMeybohm: kubernetes/staging: Enable Priority admission plugin in staging [puppet] - 10https://gerrit.wikimedia.org/r/713806 (https://phabricator.wikimedia.org/T289131) [07:17:53] (03PS1) 10JMeybohm: kubernetes: Enable Priority admission plugin [puppet] - 10https://gerrit.wikimedia.org/r/713807 (https://phabricator.wikimedia.org/T289131) [07:19:23] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30699/console" [puppet] - 10https://gerrit.wikimedia.org/r/713807 (https://phabricator.wikimedia.org/T289131) (owner: 10JMeybohm) [07:20:51] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30700/console" [puppet] - 10https://gerrit.wikimedia.org/r/713804 (https://phabricator.wikimedia.org/T289131) (owner: 10JMeybohm) [07:24:41] (03CR) 10Kosta Harlan: [C: 04-2] WikimediaEvents: Remove UnderstandingFirstDay config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713553 (owner: 10Kosta Harlan) [07:25:38] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 5 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30701/console" [puppet] - 10https://gerrit.wikimedia.org/r/713805 (https://phabricator.wikimedia.org/T289131) (owner: 10JMeybohm) [07:27:21] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30702/console" [puppet] - 10https://gerrit.wikimedia.org/r/713806 (https://phabricator.wikimedia.org/T289131) (owner: 10JMeybohm) [07:28:43] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30703/console" [puppet] - 10https://gerrit.wikimedia.org/r/713807 (https://phabricator.wikimedia.org/T289131) (owner: 10JMeybohm) [07:30:07] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] kubernetes/staging: Reorder hiera keys to match production order [puppet] - 10https://gerrit.wikimedia.org/r/713804 (https://phabricator.wikimedia.org/T289131) (owner: 10JMeybohm) [07:30:14] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] kubernetes/staging: Enable Priority admission plugin in codfw [puppet] - 10https://gerrit.wikimedia.org/r/713805 (https://phabricator.wikimedia.org/T289131) (owner: 10JMeybohm) [07:30:53] (03CR) 10Filippo Giunchedi: [C: 03+1] aptrepo: add opensearch 1.x component [puppet] - 10https://gerrit.wikimedia.org/r/713701 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [07:34:50] (03CR) 10Razzi: [C: 03+1] "Looks great, nice work Ben. Added 1 question about log4j repetitiveness and a couple tiny whitespace nitpicks. Feel free to make minor cha" [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [08:19:04] (03CR) 10Filippo Giunchedi: "(note to self) Note the same problem is present on memcached (e.g. as used on swift) on first install. Namely memcached will listen on loc" [puppet] - 10https://gerrit.wikimedia.org/r/705704 (owner: 10Filippo Giunchedi) [08:21:12] (03PS1) 10David Caro: wmcs: enforce a minimum spicerack version [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/713812 [08:21:43] (03CR) 10David Caro: "@andrew this should fix your issues with the current setup." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/713812 (owner: 10David Caro) [08:37:50] (03PS1) 10Filippo Giunchedi: swift: prefix Bullseye pipelines with proxy-logging [puppet] - 10https://gerrit.wikimedia.org/r/713815 (https://phabricator.wikimedia.org/T288815) [08:38:35] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Ladsgroup) {meme, src=itshappening} [08:41:50] !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [08:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:39] (03CR) 10Filippo Giunchedi: "LGTM, thank you! Swift upstream also came back on the bug report I opened and I got a patch out at If075b4d40. Not the best timing unfort" [puppet] - 10https://gerrit.wikimedia.org/r/713725 (https://phabricator.wikimedia.org/T288815) (owner: 10RLazarus) [08:48:24] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [08:48:29] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [08:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:53] godog, I am going to slowly ramp up backup speed on eqiad [08:56:07] jynus: *nod* SGTM [08:57:02] the wikis I am backing up now have very small files, so overhead per files is very large with just 1 thread [09:00:32] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: add service_names [puppet] - 10https://gerrit.wikimedia.org/r/712098 (owner: 10Filippo Giunchedi) [09:00:52] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: add sd module [puppet] - 10https://gerrit.wikimedia.org/r/712099 (owner: 10Filippo Giunchedi) [09:01:02] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: add lb module [puppet] - 10https://gerrit.wikimedia.org/r/712100 (owner: 10Filippo Giunchedi) [09:09:40] (03CR) 10Hnowlan: [C: 03+1] "This looks okay to me - changing runtime values at startup was new to me, neat." [puppet] - 10https://gerrit.wikimedia.org/r/713504 (https://phabricator.wikimedia.org/T288815) (owner: 10RLazarus) [09:36:24] jynus: FYI I'll be deploying a change to swift frontends and thus do a rolling depool/pool, shouldn't affect your testing though [09:39:34] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: disable ecdhe curve in tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/713610 (https://phabricator.wikimedia.org/T279637) (owner: 10Filippo Giunchedi) [09:39:41] (03PS2) 10Filippo Giunchedi: swift: disable ecdhe curve in tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/713610 (https://phabricator.wikimedia.org/T279637) [09:40:44] godog, if reading files for some reason fails, it just mark that download as failed and will be retried later [09:41:55] neat, SGTM [09:42:07] that happens naturally anyway, as by the time a get is done, a file can be moved (renamed), deleted, etc. [09:43:17] ah actually I was mistaken, it is a nginx reload not restart, even better [09:43:24] so yeah should be basically hitless afaik [09:45:15] I may ask you what is the prefered strategy- e.g. if I should try to backup from eqiad as fast as possible before switch back, etc. [09:48:39] jynus: yes I think testing for speed/concurrency we should do it in eqiad now since it is idle, IMHO [09:49:15] I'm waiting for new ms-be hardware in eqiad, that's happening in a week at least [09:59:19] (03CR) 10Effie Mouzeli: "PCC https://puppet-compiler.wmflabs.org/compiler1002/30696/" [puppet] - 10https://gerrit.wikimedia.org/r/713655 (https://phabricator.wikimedia.org/T280582) (owner: 10Effie Mouzeli) [10:00:04] mvolz: #bothumor I � Unicode. All rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210819T1000). [10:01:48] (03PS1) 10Lucas Werkmeister (WMDE): Update termbox [extensions/Wikibase] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713523 (https://phabricator.wikimedia.org/T286775) [10:02:34] !log roll-reload nginx on ms-fe to apply config change [10:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:04] (03CR) 10Kormat: [C: 03+1] hieradata: remove shard01 from redis_sessions [puppet] - 10https://gerrit.wikimedia.org/r/713655 (https://phabricator.wikimedia.org/T280582) (owner: 10Effie Mouzeli) [10:03:26] (03PS1) 10Lucas Werkmeister (WMDE): Revert "Don't set termbox v2 tags yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713824 (https://phabricator.wikimedia.org/T236893) [10:05:43] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: remove shard01 from redis_sessions [puppet] - 10https://gerrit.wikimedia.org/r/713655 (https://phabricator.wikimedia.org/T280582) (owner: 10Effie Mouzeli) [10:06:14] 10SRE, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Add linecard diversity to the router-to-router interconnect in codfw - https://phabricator.wikimedia.org/T248506 (10ayounsi) [10:06:45] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) install MPC7E-MRATE FPC into cr[12]-codfw - https://phabricator.wikimedia.org/T277341 (10ayounsi) [10:12:32] !log restart php-fpm on phab1001 [10:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:19] (03CR) 10Mvolz: [C: 03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/712405 (owner: 10PipelineBot) [10:23:10] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/712405 (owner: 10PipelineBot) [10:26:43] (03PS1) 10MSantos: maps: update imposm mapping [puppet] - 10https://gerrit.wikimedia.org/r/713827 (https://phabricator.wikimedia.org/T288400) [10:30:39] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to > 2.44.10 - https://phabricator.wikimedia.org/T265549 (10Aklapper) [10:34:18] PROBLEM - Disk space on urldownloader2002 is CRITICAL: DISK CRITICAL - free space: / 336 MB (3% inode=85%): /tmp 336 MB (3% inode=85%): /var/tmp 336 MB (3% inode=85%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=urldownloader2002&var-datasource=codfw+prometheus/ops [10:36:50] !log mvolz@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' . [10:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:11] !log mvolz@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' . [10:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:47] !log mvolz@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' . [10:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:49] 10SRE, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712, 10affects-Kiwix-and-openZIM: Pages whose title ends with semicolon (;) are intermittently inaccessible (likely due to ATS) - https://phabricator.wikimedia.org/T238285 (10Kelson) This seems to be the root cause of: https://el.wikipedia.or... [10:56:00] (03PS1) 10ZPapierski: Prepare a staging test for flink networking issues [deployment-charts] - 10https://gerrit.wikimedia.org/r/713830 [10:56:53] (03PS5) 10Vgutierrez: envoyproxy: Support alpn_protocols configuration [puppet] - 10https://gerrit.wikimedia.org/r/713238 (https://phabricator.wikimedia.org/T271421) [10:56:55] (03PS5) 10Vgutierrez: envoyproxy: Suport TLS min/max version config [puppet] - 10https://gerrit.wikimedia.org/r/713246 (https://phabricator.wikimedia.org/T271421) [10:56:57] (03PS4) 10Vgutierrez: envoyproxy: Allow setting a global lua script [puppet] - 10https://gerrit.wikimedia.org/r/713271 (https://phabricator.wikimedia.org/T271421) [10:56:59] (03PS4) 10Vgutierrez: cache: Use envoy lua API to provide TLS info [puppet] - 10https://gerrit.wikimedia.org/r/713272 (https://phabricator.wikimedia.org/T271421) [10:57:01] (03PS4) 10Vgutierrez: envoyproxy: Support PreserveCase HeaderKeyFormat [puppet] - 10https://gerrit.wikimedia.org/r/713460 (https://phabricator.wikimedia.org/T271421) [11:00:05] Amir1, Lucas_WMDE, and apergos: Time to snap out of that daydream and deploy EU Backport and Config trainingYour patch may or may not be deployed at the sole discretion of the deployer. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210819T1100). [11:00:05] Lucas_WMDE: A patch you scheduled for EU Backport and Config trainingYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:14] (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: Support alpn_protocols configuration [puppet] - 10https://gerrit.wikimedia.org/r/713238 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [11:00:17] o/ [11:01:19] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Update termbox [extensions/Wikibase] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713513 (https://phabricator.wikimedia.org/T286775) (owner: 10Lucas Werkmeister (WMDE)) [11:01:32] this’ll take a while to go through CI anyways (in case anyone isn’t done deploying stuff yet) [11:02:18] (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: Suport TLS min/max version config [puppet] - 10https://gerrit.wikimedia.org/r/713246 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [11:02:26] here [11:02:35] Lucas I see you are the only one with patchesin the window [11:02:40] I assume you can self serve [11:02:45] there are no trainees signed up [11:02:53] your patches looked fine :-P [11:03:21] ok \o/ [11:03:51] zuul hasn’t even started gate-and-submit yet 😔 [11:03:54] busy [11:03:55] ugh [11:04:07] come on zuul get with the program [11:04:12] (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: Allow setting a global lua script [puppet] - 10https://gerrit.wikimedia.org/r/713271 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [11:04:28] thanks jerkins.. I got it the first time [11:04:35] heh heh [11:04:43] 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-7): Deploy upgraded wikidiff2 with side-locking selection - https://phabricator.wikimedia.org/T285857 (10Daimona) Hi @MoritzMuehlenhoff, CC'ing you per T285856#7294168. Do you need anything else from us to move this forward? [11:05:25] (03CR) 10Jgiannelos: [C: 03+1] maps: standardise the maps2.0 config in codfw, remove old nodes [puppet] - 10https://gerrit.wikimedia.org/r/702687 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [11:06:00] (03PS6) 10Vgutierrez: envoyproxy: Support alpn_protocols configuration [puppet] - 10https://gerrit.wikimedia.org/r/713238 (https://phabricator.wikimedia.org/T271421) [11:06:03] (03PS6) 10Vgutierrez: envoyproxy: Support TLS min/max version config [puppet] - 10https://gerrit.wikimedia.org/r/713246 (https://phabricator.wikimedia.org/T271421) [11:06:05] (03PS5) 10Vgutierrez: envoyproxy: Allow setting a global lua script [puppet] - 10https://gerrit.wikimedia.org/r/713271 (https://phabricator.wikimedia.org/T271421) [11:06:09] (03PS5) 10Vgutierrez: cache: Use envoy lua API to provide TLS info [puppet] - 10https://gerrit.wikimedia.org/r/713272 (https://phabricator.wikimedia.org/T271421) [11:06:11] (03PS5) 10Vgutierrez: envoyproxy: Support PreserveCase HeaderKeyFormat [puppet] - 10https://gerrit.wikimedia.org/r/713460 (https://phabricator.wikimedia.org/T271421) [11:11:08] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [11:13:44] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:17:04] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:18:09] by the way, my wmf.19 backport and config change can only really be tested together [11:18:18] so I’ll probably merge both and test them on mwdebug before syncing them [11:18:38] rather than syncing the backport before I know that it works and only merging the config change afterwards [11:18:55] (and then the wmf.18 backport is the same just in case the train gets rolled back) [11:26:49] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Revert "Don't set termbox v2 tags yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713824 (https://phabricator.wikimedia.org/T236893) (owner: 10Lucas Werkmeister (WMDE)) [11:27:40] (03Merged) 10jenkins-bot: Revert "Don't set termbox v2 tags yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713824 (https://phabricator.wikimedia.org/T236893) (owner: 10Lucas Werkmeister (WMDE)) [11:27:52] hm, I was hoping the backport would merge first [11:27:55] but it shouldn’t matter [11:28:18] (03Merged) 10jenkins-bot: Update termbox [extensions/Wikibase] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713513 (https://phabricator.wikimedia.org/T286775) (owner: 10Lucas Werkmeister (WMDE)) [11:28:46] hm, but I have another problem [11:28:59] my x-wikimedia-debug extension is still refusing to open the server dropdown [11:29:05] which means I can only test against mwdebug1001, which is read-only [11:29:18] can someone else help me test this? [11:31:16] “Unchecked lastError value: Error: Promised response from onMessage listener went out of scope” in the browser console, I think that’s due to the wikimediadebug issue [11:33:23] lol, I switched to Chromium and there the extension hangs the whole browser [11:34:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:40] Amir1: are you able to set the wikimediadebug extension to mwdebug2001? [11:35:02] it seems so [11:35:11] ok, then please try making a mobile termbox edit there [11:35:22] https://m.wikidata.org/wiki/Q4115189 [11:35:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:09] Lucas_WMDE: errors, did you deploy the backport? [11:36:19] I think I did [11:36:21] oh wait [11:36:23] one second [11:36:50] I don't remember getting emails for that [11:37:00] try again? [11:37:55] yup [11:37:59] (I’d forgotten to `git submodule update view/lib/wikibase-termbox`) [11:38:07] yay [11:38:14] (03CR) 10Effie Mouzeli: [C: 04-1] "Chart version bump missing" [deployment-charts] - 10https://gerrit.wikimedia.org/r/713830 (owner: 10ZPapierski) [11:38:15] alright, then I’ll sync termbox first and then the config change [11:38:55] (03PS2) 10ZPapierski: Prepare a staging test for flink networking issues [deployment-charts] - 10https://gerrit.wikimedia.org/r/713830 [11:39:10] !log lucaswerkmeister-wmde@deploy1002 sync-file aborted: Backport: [[gerrit:713513|Update termbox (T236893T286775)]] (duration: 00m 01s) [11:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:42] (03CR) 10ZPapierski: Prepare a staging test for flink networking issues (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/713830 (owner: 10ZPapierski) [11:39:53] (it turned out the task ID in my clipboard had a line break in it and that started the scap a tad too early :<) [11:39:57] (it’s properly syncing now) [11:40:25] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.37.0-wmf.19/extensions/Wikibase/view/lib/wikibase-termbox/: Backport: [[gerrit:713513|Update termbox (T236893, T286775)]] (duration: 01m 08s) [11:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:34] T236893: Tag all edits made via Wikibase View and Wikibase Client - https://phabricator.wikimedia.org/T236893 [11:40:35] T286775: Tag edits made via Termbox v2 - https://phabricator.wikimedia.org/T286775 [11:41:17] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Update termbox [extensions/Wikibase] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713523 (https://phabricator.wikimedia.org/T286775) (owner: 10Lucas Werkmeister (WMDE)) [11:41:54] sorry about that, I'd wandered off [11:41:59] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:713824|Revert "Don't set termbox v2 tags yet" (T236893, T286775)]] (duration: 01m 06s) [11:42:02] do you still need help testing? [11:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:09] no, should be all good now [11:42:16] 👍 [11:42:19] I’ll just wait for the wmf.18 backport to merge and then sync it directly [11:42:27] (03CR) 10Effie Mouzeli: [C: 03+2] Prepare a staging test for flink networking issues [deployment-charts] - 10https://gerrit.wikimedia.org/r/713830 (owner: 10ZPapierski) [11:42:31] and I should file a task about the debug extension not working, that’s weird [11:42:49] (03PS2) 10Effie Mouzeli: admin: add comment about tillerClusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/712307 [11:44:46] (03PS1) 10Effie Mouzeli: hieradata: remove 9 redis shards [puppet] - 10https://gerrit.wikimedia.org/r/713842 (https://phabricator.wikimedia.org/T280582) [11:45:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:24] (03CR) 10Jgiannelos: [C: 03+1] maps: update imposm mapping [puppet] - 10https://gerrit.wikimedia.org/r/713827 (https://phabricator.wikimedia.org/T288400) (owner: 10MSantos) [11:46:13] that's weirdabout chromium [11:46:36] (03Merged) 10jenkins-bot: Prepare a staging test for flink networking issues [deployment-charts] - 10https://gerrit.wikimedia.org/r/713830 (owner: 10ZPapierski) [11:49:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:54] (03PS2) 10Effie Mouzeli: hieradata: remove 9 redis shards [puppet] - 10https://gerrit.wikimedia.org/r/713842 (https://phabricator.wikimedia.org/T280582) [11:50:14] filed https://phabricator.wikimedia.org/T289246 [11:51:10] (03CR) 10Effie Mouzeli: "PCC is ok https://puppet-compiler.wmflabs.org/compiler1003/30707/" [puppet] - 10https://gerrit.wikimedia.org/r/713842 (https://phabricator.wikimedia.org/T280582) (owner: 10Effie Mouzeli) [11:51:49] 10SRE, 10serviceops: Run httpbb periodically - https://phabricator.wikimedia.org/T289202 (10fgiunchedi) Two cents re: metrics/alerting, we have the prometheus pushgateway available which seems like a good fit (more info: https://wikitech.wikimedia.org/wiki/Prometheus#Ephemeral_jobs_(Pushgateway)) [11:53:09] (03PS3) 10Effie Mouzeli: hieradata: remove 9 redis shards [puppet] - 10https://gerrit.wikimedia.org/r/713842 (https://phabricator.wikimedia.org/T280582) [11:55:06] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 233, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:56:06] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:56:44] RECOVERY - Disk space on urldownloader2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=urldownloader2002&var-datasource=codfw+prometheus/ops [11:56:48] !log zpapierski@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [11:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:05] (03PS1) 10Alexandros Kosiaris: url_downloader: Don't cache ICMP database [puppet] - 10https://gerrit.wikimedia.org/r/713843 [12:07:34] (03Merged) 10jenkins-bot: Update termbox [extensions/Wikibase] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713523 (https://phabricator.wikimedia.org/T286775) (owner: 10Lucas Werkmeister (WMDE)) [12:10:17] ^ syncing [12:11:19] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.37.0-wmf.18/extensions/Wikibase/view/lib/wikibase-termbox/: Backport: [[gerrit:713523|Update termbox (T236893, T286775)]] (duration: 01m 08s) [12:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:29] T236893: Tag all edits made via Wikibase View and Wikibase Client - https://phabricator.wikimedia.org/T236893 [12:11:29] T286775: Tag edits made via Termbox v2 - https://phabricator.wikimedia.org/T286775 [12:11:31] !log EU backport+config window done [12:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:46] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:26:14] 10SRE, 10ops-codfw: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 (10Papaul) @jijiki it looks like mw2383 is happy now can we close this task ? Thanks [12:35:28] !log zpapierski@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [12:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:42] (03PS1) 10Kormat: ProductionServices: Promote pc1011 to primary of pc1. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713845 (https://phabricator.wikimedia.org/T284825) [12:48:54] (03CR) 10Abijeet Patro: [C: 03+1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/713837 (owner: 10L10n-bot) [12:49:42] (03PS1) 10Kormat: ProductionServices: Promote pc1012 to primary of pc2. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713866 (https://phabricator.wikimedia.org/T284825) [12:52:24] (03PS1) 10Kormat: ProductionServices: Promote pc2013 to primary of pc3. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713867 [12:54:21] (03PS2) 10Kormat: ProductionServices: Promote pc1012 to primary of pc2. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713866 (https://phabricator.wikimedia.org/T284825) [12:54:23] (03PS2) 10Kormat: ProductionServices: Promote pc2013 to primary of pc3. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713867 (https://phabricator.wikimedia.org/T284825) [12:56:12] PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash7-codfw,logstash7-eqiad} instance=kafkamon1002 job=burrow partition={0,1,2,3,4,5} prometheus=ops site=eqiad topic=rsyslog-notice https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=th [12:56:12] -cluster=logging-eqiad&var-topic=All&var-consumer_group=All [12:58:46] (03CR) 10Hnowlan: [C: 03+2] Maps: filter out non-administrative boundaries on OSM import [puppet] - 10https://gerrit.wikimedia.org/r/704784 (owner: 10Jgiannelos) [12:59:51] (03PS2) 10Kormat: ProductionServices: Promote pc1011 to primary of pc1. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713845 (https://phabricator.wikimedia.org/T284825) [12:59:54] (03PS3) 10Kormat: ProductionServices: Promote pc1012 to primary of pc2. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713866 (https://phabricator.wikimedia.org/T284825) [12:59:55] (03PS3) 10Kormat: ProductionServices: Promote pc2013 to primary of pc3. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713867 (https://phabricator.wikimedia.org/T284825) [13:00:49] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30715/console" [puppet] - 10https://gerrit.wikimedia.org/r/713674 (owner: 10Jgiannelos) [13:01:59] (03CR) 10jerkins-bot: [V: 04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/713837 (owner: 10L10n-bot) [13:02:12] jouncebot: now [13:02:13] No deployments scheduled for the next 2 hour(s) and 57 minute(s) [13:03:12] (03CR) 10Addshore: [C: 03+1] service: Enable paging for shellbox-constraints service [puppet] - 10https://gerrit.wikimedia.org/r/711737 (owner: 10Legoktm) [13:03:31] (03CR) 10Kormat: [C: 03+2] ProductionServices: Promote pc1011 to primary of pc1. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713845 (https://phabricator.wikimedia.org/T284825) (owner: 10Kormat) [13:03:32] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:03:36] (03CR) 10Kormat: [C: 03+2] ProductionServices: Promote pc1012 to primary of pc2. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713866 (https://phabricator.wikimedia.org/T284825) (owner: 10Kormat) [13:04:09] (03PS4) 10Kormat: ProductionServices: Promote pc1013 to primary of pc3. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713867 (https://phabricator.wikimedia.org/T284825) [13:04:33] (03Merged) 10jenkins-bot: ProductionServices: Promote pc1011 to primary of pc1. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713845 (https://phabricator.wikimedia.org/T284825) (owner: 10Kormat) [13:04:36] (03Merged) 10jenkins-bot: ProductionServices: Promote pc1012 to primary of pc2. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713866 (https://phabricator.wikimedia.org/T284825) (owner: 10Kormat) [13:05:58] (03CR) 10Kormat: [C: 03+2] ProductionServices: Promote pc1013 to primary of pc3. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713867 (https://phabricator.wikimedia.org/T284825) (owner: 10Kormat) [13:07:15] (03Merged) 10jenkins-bot: ProductionServices: Promote pc1013 to primary of pc3. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713867 (https://phabricator.wikimedia.org/T284825) (owner: 10Kormat) [13:09:06] !log kormat@deploy1002 Synchronized wmf-config/ProductionServices.php: Promote new h/w to primary of eqiad pc sections T284825 (duration: 01m 08s) [13:09:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:14] T284825: Productionize pc2011-pc2014 and pc1011-pc1014 - https://phabricator.wikimedia.org/T284825 [13:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:41] 10SRE-Access-Requests: Requesting access to restricted and analytics-privatedata-users for Chmielko Maslak - https://phabricator.wikimedia.org/T289257 (10Nahid) [13:17:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:36] 10SRE-Access-Requests: Requesting access to restricted and analytics-privatedata-users for Chmielko Maslak - https://phabricator.wikimedia.org/T289257 (10NNair) This is approved for Chmielko. - Neha [13:17:59] 10SRE-Access-Requests: Requesting access to restricted and analytics-privatedata-users for Chmielko Maslak - https://phabricator.wikimedia.org/T289257 (10chmielkomaslak) [13:18:44] 10SRE-Access-Requests: Requesting access to restricted and analytics-privatedata-users for Chmielko Maslak - https://phabricator.wikimedia.org/T289257 (10Nahid) [13:18:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:00] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:24:17] !log reconfiguring replication tree on pc1 T284825 [13:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:26] T284825: Productionize pc2011-pc2014 and pc1011-pc1014 - https://phabricator.wikimedia.org/T284825 [13:28:20] 10SRE-Access-Requests: Requesting access to restricted and analytics-privatedata-users for Kate Levan - https://phabricator.wikimedia.org/T289258 (10Nahid) [13:30:22] !log reconfiguring replication tree on pc2 T284825 [13:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:30] T284825: Productionize pc2011-pc2014 and pc1011-pc1014 - https://phabricator.wikimedia.org/T284825 [13:32:11] 10SRE-Access-Requests: Requesting access to restricted and analytics-privatedata-users for Kate Levan - https://phabricator.wikimedia.org/T289258 (10NNair) This is approved for Kate. -Neha [13:32:43] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] maps: Allow creating ad-hoc python venvs for maps scripts [puppet] - 10https://gerrit.wikimedia.org/r/713674 (owner: 10Jgiannelos) [13:33:44] 10SRE-Access-Requests: Requesting access to restricted and analytics-privatedata-users for Nathan Forrester - https://phabricator.wikimedia.org/T289259 (10Nahid) [13:34:24] !log reconfiguring replication tree on pc3 T284825 [13:34:26] 10SRE-Access-Requests: Requesting access to restricted and analytics-privatedata-users for Nathan Forrester - https://phabricator.wikimedia.org/T289259 (10NNair) This is approved for Nathan. Thank you, Neha [13:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:44] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 90.14% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [13:35:32] (03PS2) 10Michael Große: Don't cache Query Builder index.html [puppet] - 10https://gerrit.wikimedia.org/r/713870 [13:36:03] (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/713871 (https://phabricator.wikimedia.org/T289045) [13:37:15] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/713871 (https://phabricator.wikimedia.org/T289045) (owner: 10Kosta Harlan) [13:39:45] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/713871 (https://phabricator.wikimedia.org/T289045) (owner: 10Kosta Harlan) [13:40:14] 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Kate Levan - https://phabricator.wikimedia.org/T289258 (10Nahid) [13:40:29] !log kharlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [13:40:35] (03PS2) 10MSantos: maps: update imposm mapping [puppet] - 10https://gerrit.wikimedia.org/r/713827 (https://phabricator.wikimedia.org/T288400) [13:40:36] 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Nathan Forrester - https://phabricator.wikimedia.org/T289259 (10Nahid) [13:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:04] 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Kate Levan - https://phabricator.wikimedia.org/T289258 (10KLevan) [13:41:14] 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Chmielko Maslak - https://phabricator.wikimedia.org/T289257 (10Nahid) [13:42:05] 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Kate Levan - https://phabricator.wikimedia.org/T289258 (10Nahid) [13:42:53] !log Start server-side upload for 1 video file (T289203) [13:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:05] T289203: Please upload a 729 MB video file to Wikimedia Commons - https://phabricator.wikimedia.org/T289203 [13:44:25] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [13:44:25] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'internal' . [13:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:53] !log Start server-side upload for 1 video file (T288628) [13:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:03] T288628: Server side upload for Jayanta (CIS-A2K) - https://phabricator.wikimedia.org/T288628 [13:47:38] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'internal' . [13:47:38] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [13:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:41] !log Start server-side upload for 1 video file (T288554) [13:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:48] T288554: Server side upload for PantheraLeo1359531 - https://phabricator.wikimedia.org/T288554 [13:49:11] !log Start server-side upload for 1 video file (T288384) [13:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:18] T288384: Server side upload for Victorgrigas - https://phabricator.wikimedia.org/T288384 [13:52:22] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Nathan Forrester - https://phabricator.wikimedia.org/T289259 (10NForrester) [13:58:59] !log zpapierski@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [13:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:30] (03PS3) 10Michael Große: Don't cache Query Builder index.html [puppet] - 10https://gerrit.wikimedia.org/r/713870 [14:01:42] 10SRE, 10ops-codfw: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 (10jijiki) 05Open→03Resolved @Papaul, server works as it should, thank you very much for finding this! [14:03:42] (03PS1) 10Effie Mouzeli: hieradata: add new memcached servers to mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/713875 (https://phabricator.wikimedia.org/T278225) [14:06:43] (03CR) 10Effie Mouzeli: [C: 03+2] admin: add comment about tillerClusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/712307 (owner: 10Effie Mouzeli) [14:09:14] (03Merged) 10jenkins-bot: admin: add comment about tillerClusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/712307 (owner: 10Effie Mouzeli) [14:11:39] 10SRE, 10ops-codfw: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 (10Papaul) You welcome [14:11:53] (03PS2) 10Effie Mouzeli: hieradata: add new memcached servers to mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/713875 (https://phabricator.wikimedia.org/T278225) [14:13:15] (03CR) 10Krinkle: [C: 03+1] hieradata: remove 9 redis shards [puppet] - 10https://gerrit.wikimedia.org/r/713842 (https://phabricator.wikimedia.org/T280582) (owner: 10Effie Mouzeli) [14:14:40] PROBLEM - Host maps2005 is DOWN: PING CRITICAL - Packet loss = 100% [14:16:24] (03PS1) 10Vgutierrez: envoyproxy: Allow configuring the admin address [puppet] - 10https://gerrit.wikimedia.org/r/713879 (https://phabricator.wikimedia.org/T271421) [14:16:40] RECOVERY - Host maps2005 is UP: PING OK - Packet loss = 0%, RTA = 31.61 ms [14:17:42] (03CR) 10MSantos: [C: 03+1] maps: standardise the maps2.0 config in codfw, remove old nodes [puppet] - 10https://gerrit.wikimedia.org/r/702687 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [14:17:53] (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: Allow configuring the admin address [puppet] - 10https://gerrit.wikimedia.org/r/713879 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [14:22:44] RECOVERY - IPMI Sensor Status on maps2005 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:24:39] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: remove 9 redis shards [puppet] - 10https://gerrit.wikimedia.org/r/713842 (https://phabricator.wikimedia.org/T280582) (owner: 10Effie Mouzeli) [14:24:54] PROBLEM - Logstash rate of ingestion percent change compared to yesterday #o11y on alert1001 is CRITICAL: 508.3 ge 210 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [14:26:11] investigating ^ [14:26:26] !log disable puppet on mediawiki and memcached servers for 713842 [14:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:24] (03PS2) 10Vgutierrez: envoyproxy: Allow configuring the admin address [puppet] - 10https://gerrit.wikimedia.org/r/713879 (https://phabricator.wikimedia.org/T271421) [14:30:58] (03CR) 10Vgutierrez: "PCC looks as expected: https://puppet-compiler.wmflabs.org/compiler1001/30717/" [puppet] - 10https://gerrit.wikimedia.org/r/713879 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [14:31:39] (03PS3) 10Vgutierrez: envoyproxy: Allow configuring the admin address [puppet] - 10https://gerrit.wikimedia.org/r/713879 (https://phabricator.wikimedia.org/T271421) [14:31:44] PROBLEM - kartotherian endpoints health on maps2005 is CRITICAL: /osm-intl/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [14:33:21] zpapierski: rdf-streaming-updater in kube stage is spamming logs [14:33:53] to the tune of >10k/s [14:36:19] ! enable puppet on mediawiki and memcached servers for 713842 [14:36:28] !log enable puppet on mediawiki and memcached servers for 713842 [14:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:45] zpapierski: looks like in debug mode a whole lot of binary-serialized data is being put into logs [14:38:19] @godog if zpapierski is not around, I can revert this [14:38:31] let me know [14:38:47] yeah, we should make it stop [14:38:57] effie: ack, thank you [14:39:14] I am tailing the problematic files on kubestage1001 and it seems paused/stopped at least for now [14:39:16] ok give me 5' [14:39:23] definitely better to revert [14:39:30] will do [14:41:19] (03PS1) 10Effie Mouzeli: flink-session-cluster: remove debug logging due to log spamming [deployment-charts] - 10https://gerrit.wikimedia.org/r/713882 [14:42:03] (03CR) 10Filippo Giunchedi: [C: 03+1] flink-session-cluster: remove debug logging due to log spamming [deployment-charts] - 10https://gerrit.wikimedia.org/r/713882 (owner: 10Effie Mouzeli) [14:44:18] (03CR) 10Effie Mouzeli: [C: 03+2] flink-session-cluster: remove debug logging due to log spamming [deployment-charts] - 10https://gerrit.wikimedia.org/r/713882 (owner: 10Effie Mouzeli) [14:44:42] 10SRE, 10ops-codfw, 10Maps: maps2005 power suply failure since a week - https://phabricator.wikimedia.org/T289113 (10Papaul) 05Open→03Resolved It was a loose power cord . Fixed [14:44:53] 10SRE, 10Cassandra, 10RESTBase-Cassandra, 10Patch-For-Review, and 2 others: Configure a threshold for earlier notification of /srv/cassandra/instance-data - https://phabricator.wikimedia.org/T191659 (10hnowlan) 05Open→03Resolved [14:45:26] PROBLEM - wikimedia-client-errors-alerts grafana alert on alert1001 is CRITICAL: CRITICAL: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is alerting: Client error alert. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [14:45:49] (03CR) 10Ssingh: [C: 03+1] envoyproxy: Allow configuring the admin address [puppet] - 10https://gerrit.wikimedia.org/r/713879 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [14:46:19] (03Abandoned) 10Vgutierrez: envoyproxy: Allow configuring the admin address [puppet] - 10https://gerrit.wikimedia.org/r/713879 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [14:47:22] 10SRE, 10ops-codfw, 10DC-Ops: codfw: Netbox Error - https://phabricator.wikimedia.org/T288586 (10Papaul) 05Open→03Resolved Have in a temporary cable ID for this. I will remove the serial cable once configuration is done. [14:47:36] (03Merged) 10jenkins-bot: flink-session-cluster: remove debug logging due to log spamming [deployment-charts] - 10https://gerrit.wikimedia.org/r/713882 (owner: 10Effie Mouzeli) [14:49:34] !log jiji@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [14:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:14] (03PS2) 10Legoktm: url_downloader: Don't cache ICMP database [puppet] - 10https://gerrit.wikimedia.org/r/713843 (https://phabricator.wikimedia.org/T286525) (owner: 10Alexandros Kosiaris) [14:51:15] !log jiji@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [14:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:50] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Nathan Forrester - https://phabricator.wikimedia.org/T289259 (10RobH) @odimitrijevic, This is one of three current requests to add a new wmf employee to both ‘restricted’ and ‘analytics-priv... [14:58:54] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Kate Levan - https://phabricator.wikimedia.org/T289258 (10RobH) @odimitrijevic, This is one of three current requests to add a new wmf employee to both ‘restricted’ and ‘analytics-privatedat... [14:59:03] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Chmielko Maslak - https://phabricator.wikimedia.org/T289257 (10RobH) @odimitrijevic, This is one of three current requests to add a new wmf employee to both ‘restricted’ and ‘analytics-priva... [14:59:05] (03CR) 10Herron: Add Varnish SLO dashboard (033 comments) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/713440 (https://phabricator.wikimedia.org/T289036) (owner: 10Ema) [14:59:53] additionally logstash json logs are filling the disk [15:00:18] https://phabricator.wikimedia.org/P17047 [15:00:25] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Chmielko Maslak - https://phabricator.wikimedia.org/T289257 (10RobH) Please note that the checklist is to be audited and checked off by SRE clinic duty, and not by third parties. Since someo... [15:00:33] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Kate Levan - https://phabricator.wikimedia.org/T289258 (10RobH) Please note that the checklist is to be audited and checked off by SRE clinic duty, and not by third parties. Since someone el... [15:00:36] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Nathan Forrester - https://phabricator.wikimedia.org/T289259 (10RobH) Please note that the checklist is to be audited and checked off by SRE clinic duty, and not by third parties. Since some... [15:01:12] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Nathan Forrester - https://phabricator.wikimedia.org/T289259 (10RobH) [15:04:21] !log clean logstash json logs off logstash hosts [15:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:39] lots of old logstash-plain-* logs that can be removed as well [15:05:49] godog: ^ [15:06:00] (03PS2) 10Hnowlan: maps: move configuration overrides to main configuration [puppet] - 10https://gerrit.wikimedia.org/r/713451 (https://phabricator.wikimedia.org/T288810) [15:06:05] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on maps[1001-1004].eqiad.wmnet with reason: Awaiting decommissioning [15:06:09] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on maps[1001-1004].eqiad.wmnet with reason: Awaiting decommissioning [15:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:39] cwhite: ack thanks [15:08:10] RECOVERY - kartotherian endpoints health on maps2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [15:09:47] Did that revert make it to kubestage? There's still a lot of logs streaming in. [15:10:20] Another option is we can selectively drop these logs in the pipeline to protect the cluster. [15:10:27] I think that's the backlog on kafka draining, but yeah looks like the spam has ended [15:11:15] yeah I mean the json parsing failures on the logstash end are just noise at this point [15:11:34] PROBLEM - Check health of redis instance on 6379 on mc1025 is CRITICAL: CRITICAL: replication_delay is 646 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 356407 keys, up 114 days 22 hours - replication_delay is 646 https://wikitech.wikimedia.org/wiki/Redis [15:11:47] (03CR) 10Ahmon Dancy: Allow protocol for etcd server to be specified (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713704 (owner: 10Ahmon Dancy) [15:13:48] cwhite: what do you think ? the lag on kafka for rsyslog-notice isn't decreasing yet [15:14:00] PROBLEM - kartotherian endpoints health on maps2005 is CRITICAL: /osm-intl/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [15:15:40] godog: I think we should insert a drop. What do you think? [15:15:56] agreed, no point in keep doing the work [15:16:47] I'll send out a review [15:16:53] (03CR) 10Ahmon Dancy: Allow protocol for etcd server to be specified (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713704 (owner: 10Ahmon Dancy) [15:16:56] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:18:16] PROBLEM - Check health of redis instance on 6379 on mc1031 is CRITICAL: CRITICAL: replication_delay is 658 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 391223 keys, up 174 days 2 hours - replication_delay is 658 https://wikitech.wikimedia.org/wiki/Redis [15:19:18] cwhite: mmhh I'm not sure what's the best way to drop the problematic messages, and only those [15:19:34] https://phabricator.wikimedia.org/P17047 from these logstash errors [15:19:48] RECOVERY - kartotherian endpoints health on maps2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [15:21:04] there isn't a great way to detect invalid json and drop it without a custom filter [15:22:01] godog: err, scratch that. I think there may be a way: https://www.elastic.co/guide/en/logstash/current/plugins-filters-json.html#plugins-filters-json-skip_on_invalid_json [15:22:57] worth a try! [15:23:02] better than the log spam [15:23:30] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:23:39] I have a meeting in 7 though [15:24:33] although, skipping won't drop it, it will just pass it on [15:24:43] * cwhite sighs [15:25:16] (03CR) 10Krinkle: [C: 04-1] Allow protocol for etcd server to be specified (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713704 (owner: 10Ahmon Dancy) [15:25:23] !log jiji@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [15:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:44] fun [15:27:20] yeah not sure what's the best way to pinpoint and drop the problematic messages [15:29:36] !log jiji@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [15:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:02] I have a meeting now but will resume later [15:30:42] (03PS1) 10Effie Mouzeli: rdf-streaming-updater: reduce task_manager replicas in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/713891 [15:33:29] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2005.codfw.wmnet [15:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:49] (03CR) 10Effie Mouzeli: [C: 03+2] rdf-streaming-updater: reduce task_manager replicas in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/713891 (owner: 10Effie Mouzeli) [15:35:30] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2005.codfw.wmnet [15:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:42] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:37:39] (03Merged) 10jenkins-bot: rdf-streaming-updater: reduce task_manager replicas in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/713891 (owner: 10Effie Mouzeli) [15:38:06] godog: sorry for log spam, flink either logs almost nothing or too much altogether [15:38:14] I won't be doing that anymore on staging [15:38:35] !log jiji@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [15:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:32] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2005.codfw.wmnet [15:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:28] PROBLEM - Aggregate IPsec Tunnel Status codfw on alert1001 is CRITICAL: instance=mc2033 site=codfw tunnel=mc1033_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [15:42:45] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2005.codfw.wmnet [15:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:30] RECOVERY - Aggregate IPsec Tunnel Status codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [15:43:48] ^ τηοσε αρε με [15:43:52] grrr [15:44:01] ^ those alerts are mine [15:46:30] heh [15:47:44] (03PS1) 10Cwhite: logstash: temporarily mitigate rdf-streaming-updater spam [puppet] - 10https://gerrit.wikimedia.org/r/713892 [15:48:58] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:50:04] !log test2wiki)> delete from flaggedtemplates where ft_rev_id not in (select fp_stable from flaggedpages); (T289249) [15:50:10] (03CR) 10Cwhite: [C: 03+2] logstash: temporarily mitigate rdf-streaming-updater spam [puppet] - 10https://gerrit.wikimedia.org/r/713892 (owner: 10Cwhite) [15:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:12] T289249: flaggedtemplates table should not keep the whole history of all revisions - https://phabricator.wikimedia.org/T289249 [15:52:56] !log dpifke@deploy1002 Started deploy [performance/navtiming@f8bf39f]: Deploy CpuBenchmark processor again T281243 [15:53:02] !log dpifke@deploy1002 Finished deploy [performance/navtiming@f8bf39f]: Deploy CpuBenchmark processor again T281243 (duration: 00m 06s) [15:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:05] T281243: Expose CPU benchmark data to Prometheus/Grafana - https://phabricator.wikimedia.org/T281243 [15:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:20] Amir1: is that garbage data storage on by default [15:54:52] it's basically templatelinks but for the history of the whole wiki [15:55:16] that's why for ruwiki that table is bigger than ALL other tables combined [15:55:36] beautiful codebase [15:57:33] (03PS2) 10MMandere: varnish: Containerize varnish test environment [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) [16:00:05] jbond and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210819T1600). [16:00:48] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [16:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:46] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30719/console" [puppet] - 10https://gerrit.wikimedia.org/r/713451 (https://phabricator.wikimedia.org/T288810) (owner: 10Hnowlan) [16:10:36] !log remove rotated logstash-plain-* and logstash-json-* logs on logstash collectors [16:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:03] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Make HTTP calls work within mediawiki on kubernetes - https://phabricator.wikimedia.org/T288848 (10Legoktm) a:03Legoktm >>! In T288848#7293721, @Legoktm wrote: > One other consideration is whether we need to specifically route index.php and api.php re... [16:12:39] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] maps: move configuration overrides to main configuration [puppet] - 10https://gerrit.wikimedia.org/r/713451 (https://phabricator.wikimedia.org/T288810) (owner: 10Hnowlan) [16:13:33] zpapierski: yeah looks like it, too much or too little, and (unconfirmed yet) some of the debug messages might be invalid json [16:13:44] cwhite: thank you [16:14:35] !log starting decommission of old eqiad maps hardware [16:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:55] !log hnowlan@cumin1001 START - Cookbook sre.hosts.decommission for hosts maps1001.eqiad.wmnet [16:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:14] cwhite: looks like the backlog is draining, any action other than cleaning the logs that you are aware of ? [16:16:27] (03PS1) 10Cwhite: Revert "logstash: temporarily mitigate rdf-streaming-updater spam" [puppet] - 10https://gerrit.wikimedia.org/r/713853 [16:16:59] nevermind I missed the review [16:17:19] (03CR) 10Filippo Giunchedi: [C: 03+1] Revert "logstash: temporarily mitigate rdf-streaming-updater spam" [puppet] - 10https://gerrit.wikimedia.org/r/713853 (owner: 10Cwhite) [16:17:31] godog: I think that's it. I think we'll truncate logstash-json.log after the revert. [16:17:33] (03PS1) 10Hnowlan: tegola: remove config for decommissioned hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/713899 (https://phabricator.wikimedia.org/T288810) [16:18:24] cwhite: SGTM, thank you [16:24:15] (03CR) 10Cwhite: [C: 03+2] Revert "logstash: temporarily mitigate rdf-streaming-updater spam" [puppet] - 10https://gerrit.wikimedia.org/r/713853 (owner: 10Cwhite) [16:24:55] RECOVERY - Logstash rate of ingestion percent change compared to yesterday #o11y on alert1001 is OK: (C)210 ge (W)150 ge 82.22 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [16:28:04] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts maps1001.eqiad.wmnet [16:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:36] I've looked at https://phabricator.wikimedia.org/P17047 to understand where the json parsing was failing but no luck so far, I'll file a task [16:30:52] !log hnowlan@cumin1001 START - Cookbook sre.hosts.decommission for hosts maps1002.eqiad.wmnet [16:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:28] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts maps1002.eqiad.wmnet [16:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:43] !log hnowlan@cumin1001 START - Cookbook sre.hosts.decommission for hosts maps1002.eqiad.wmnet [16:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:29] (03PS1) 10Hnowlan: site: remove decommissioned maps hosts [puppet] - 10https://gerrit.wikimedia.org/r/713903 (https://phabricator.wikimedia.org/T288810) [16:38:25] (03PS1) 10Jgiannelos: maps: Install kafkacat on maps masters [puppet] - 10https://gerrit.wikimedia.org/r/713904 [16:42:36] (03CR) 10Jgiannelos: "From docs: https://wikitech.wikimedia.org/wiki/Kafka" [puppet] - 10https://gerrit.wikimedia.org/r/713904 (owner: 10Jgiannelos) [16:43:30] (03PS2) 10Jgiannelos: maps: Install kafkacat on maps masters [puppet] - 10https://gerrit.wikimedia.org/r/713904 (https://phabricator.wikimedia.org/T270175) [16:44:43] (03PS3) 10Jgiannelos: maps: Install kafkacat on osm master nodes [puppet] - 10https://gerrit.wikimedia.org/r/713904 (https://phabricator.wikimedia.org/T270175) [16:45:07] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts maps1002.eqiad.wmnet [16:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:11] (03CR) 10jerkins-bot: [V: 04-1] maps: Install kafkacat on osm master nodes [puppet] - 10https://gerrit.wikimedia.org/r/713904 (https://phabricator.wikimedia.org/T270175) (owner: 10Jgiannelos) [16:46:26] (03CR) 10Legoktm: [C: 03+2] Re-enable Score with Shellbox on most public wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713726 (https://phabricator.wikimedia.org/T257066) (owner: 10Legoktm) [16:46:35] !log hnowlan@cumin1001 START - Cookbook sre.hosts.decommission for hosts maps1003.eqiad.wmnet [16:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:15] (03Merged) 10jenkins-bot: Re-enable Score with Shellbox on most public wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713726 (https://phabricator.wikimedia.org/T257066) (owner: 10Legoktm) [16:48:20] (03PS4) 10Jgiannelos: maps: Install kafkacat on osm master nodes [puppet] - 10https://gerrit.wikimedia.org/r/713904 (https://phabricator.wikimedia.org/T270175) [16:49:12] !log legoktm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Re-enable Score with Shellbox on most public wikis (T257066) (duration: 01m 08s) [16:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:21] T257066: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 [16:49:56] 🎉 [16:51:58] I'm staring at https://grafana.wikimedia.org/d/RKogW1m7z/shellbox?orgId=1&refresh=30s&from=now-30m&to=now and https://grafana.wikimedia.org/d/IjzWoqG7k/score?orgId=1&from=now-1h&to=now&refresh=30s - it's slowly going up [16:52:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:04] RECOVERY - wikimedia-client-errors-alerts grafana alert on alert1001 is OK: OK: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is not alerting. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [16:57:42] (03CR) 10Hnowlan: [C: 03+2] maps: Install kafkacat on osm master nodes [puppet] - 10https://gerrit.wikimedia.org/r/713904 (https://phabricator.wikimedia.org/T270175) (owner: 10Jgiannelos) [17:00:03] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts maps1003.eqiad.wmnet [17:00:05] chrisalbon and accraze: Your horoscope predicts another unfortunate Services – Graphoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210819T1700). [17:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:53] !log hnowlan@cumin1001 START - Cookbook sre.hosts.decommission for hosts maps1004.eqiad.wmnet [17:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:08] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [17:03:24] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [17:05:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=redis_maps site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:06:43] (03PS1) 10Joal: Update dumps geoeditors readme for data unavailability [puppet] - 10https://gerrit.wikimedia.org/r/713905 [17:06:53] razzi: --^ please :) [17:11:10] (03PS3) 10Ahmon Dancy: Allow protocol for etcd server to be specified [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713704 [17:11:22] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts maps1004.eqiad.wmnet [17:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:24] (03CR) 10jerkins-bot: [V: 04-1] Allow protocol for etcd server to be specified [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713704 (owner: 10Ahmon Dancy) [17:15:13] (03PS4) 10Ahmon Dancy: Allow protocol for etcd server to be specified [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713704 [17:15:15] (03PS1) 10Ahmon Dancy: Use array format to specify etcd server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713906 [17:17:32] (03CR) 10Ahmon Dancy: "Revised based on comments in patchset 2." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713704 (owner: 10Ahmon Dancy) [17:18:12] (03CR) 10Ahmon Dancy: Allow protocol for etcd server to be specified (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713704 (owner: 10Ahmon Dancy) [17:20:27] (03PS1) 10Ahmon Dancy: wmfSetupEtcd only supports array input [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713907 [17:20:55] (03CR) 10RLazarus: [V: 03+1 C: 03+2] envoyproxy: Add $runtime field to set a static runtime layer. [puppet] - 10https://gerrit.wikimedia.org/r/713504 (https://phabricator.wikimedia.org/T288815) (owner: 10RLazarus) [17:21:04] (03PS2) 10Ahmon Dancy: wmfSetupEtcd only supports array input [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713907 [17:22:05] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Quiddity) Should this be announced in the [[ https://meta.wikimedia.org/wiki/Tech/News/2021/34 | next Tech N... [17:22:39] (03CR) 10Ahmon Dancy: "Ready for re-review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713704 (owner: 10Ahmon Dancy) [17:23:22] (03CR) 10Cwhite: [C: 03+2] aptrepo: add opensearch 1.x component [puppet] - 10https://gerrit.wikimedia.org/r/713701 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [17:23:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:27:21] (03PS5) 10Herron: retire role::kafka::monitoring and kafkamon[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/713307 (https://phabricator.wikimedia.org/T252773) [17:27:32] (03CR) 10Krinkle: [C: 03+1] Allow protocol for etcd server to be specified [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713704 (owner: 10Ahmon Dancy) [17:27:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=redis_maps site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:28:53] (03CR) 10Krinkle: [C: 04-1] "LabsServices.php as well :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713906 (owner: 10Ahmon Dancy) [17:30:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:30:57] (03PS2) 10Ahmon Dancy: Use array format to specify etcd server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713906 [17:30:59] (03PS3) 10Ahmon Dancy: wmfSetupEtcd only supports array input [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713907 [17:31:10] (03CR) 10Ahmon Dancy: Use array format to specify etcd server (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713906 (owner: 10Ahmon Dancy) [17:32:43] (03PS1) 10Bstorm: openstack: stop paging for systemd alone [puppet] - 10https://gerrit.wikimedia.org/r/713909 [17:35:21] (03CR) 10Herron: [C: 03+2] retire role::kafka::monitoring and kafkamon[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/713307 (https://phabricator.wikimedia.org/T252773) (owner: 10Herron) [17:35:43] (03CR) 10Bstorm: "I added this to the role to follow the usual puppet coding guidelines instead of the profile like we do because of the old multi-deploymen" [puppet] - 10https://gerrit.wikimedia.org/r/713909 (owner: 10Bstorm) [17:36:43] (03CR) 10Razzi: [C: 03+2] Update dumps geoeditors readme for data unavailability [puppet] - 10https://gerrit.wikimedia.org/r/713905 (owner: 10Joal) [17:41:35] !log herron@cumin1001 START - Cookbook sre.hosts.decommission for hosts kafkamon1001.eqiad.wmnet [17:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:15] 10SRE, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Shrink redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10jijiki) [17:43:23] 10SRE, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Shrink redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10jijiki) As expected: {F34606863} [17:46:33] RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [17:48:53] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kafkamon1001.eqiad.wmnet [17:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:01] 10SRE, 10Analytics-Clusters, 10Analytics-Radar, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Move kafkamon hosts to Debian Buster - https://phabricator.wikimedia.org/T252773 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by herron@cumin1001 for hosts: `kafkamon1001.eqiad.wm... [17:49:36] !log herron@cumin1001 START - Cookbook sre.hosts.decommission for hosts kafkamon2001.codfw.wmnet [17:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:52] (03PS1) 10Krinkle: tests: Improve testCrossDcCompatibility to catch mismatching types [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713911 [17:58:30] (03CR) 10jerkins-bot: [V: 04-1] tests: Improve testCrossDcCompatibility to catch mismatching types [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713911 (owner: 10Krinkle) [18:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210819T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:42] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kafkamon2001.codfw.wmnet [18:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:52] 10SRE, 10Analytics-Clusters, 10Analytics-Radar, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Move kafkamon hosts to Debian Buster - https://phabricator.wikimedia.org/T252773 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by herron@cumin1001 for hosts: `kafkamon2001.codfw.wm... [18:01:36] 10SRE, 10Analytics-Clusters, 10Analytics-Radar, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Move kafkamon hosts to Debian Buster - https://phabricator.wikimedia.org/T252773 (10herron) [18:03:48] 10SRE, 10Analytics-Clusters, 10Analytics-Radar, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Move kafkamon hosts to Debian Buster - https://phabricator.wikimedia.org/T252773 (10herron) 05Open→03Resolved Old hosts have been retired and the duplicate role cleaned up, resolving! [18:27:07] !log Beginning aqs deploy process [18:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:39] 10SRE, 10serviceops, 10Patch-For-Review: Replace rdb1005, rdb1006 with rdb1011, rdb1012 - https://phabricator.wikimedia.org/T281217 (10Krinkle) [18:54:48] (03PS2) 10Krinkle: tests: Improve testCrossDcCompatibility to catch mismatching types [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713911 [18:58:19] 10SRE, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10jijiki) [19:00:04] brennen and jeena: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210819T1900). [19:00:35] here, logs look fairly quiet, rolling forward shortly. [19:03:00] (03CR) 10Ahmon Dancy: tests: Improve testCrossDcCompatibility to catch mismatching types (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713911 (owner: 10Krinkle) [19:03:46] !log razzi@deploy1002 Started deploy [analytics/aqs/deploy@57c253e]: Deploy aqs 9c062f2 [19:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:17] (03PS1) 10Brennen Bearnes: all wikis to 1.37.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713926 [19:04:19] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.37.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713926 (owner: 10Brennen Bearnes) [19:05:52] (03CR) 10Krinkle: tests: Improve testCrossDcCompatibility to catch mismatching types (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713911 (owner: 10Krinkle) [19:05:54] (03Merged) 10jenkins-bot: all wikis to 1.37.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713926 (owner: 10Brennen Bearnes) [19:07:16] !log razzi@deploy1002 Finished deploy [analytics/aqs/deploy@57c253e]: Deploy aqs 9c062f2 (duration: 03m 30s) [19:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:29] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.37.0-wmf.19 [19:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:50] (03PS5) 10Gehel: elastic: pull out execute_on_clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/706276 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [19:12:19] (03PS3) 10Krinkle: tests: Improve testCrossDcCompatibility to catch mismatching types [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713911 [19:12:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:23] (03PS6) 10Gehel: elastic: pull out execute_on_clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/706276 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [19:13:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:44] (03CR) 10Krinkle: [C: 03+1] Use array format to specify etcd server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713906 (owner: 10Ahmon Dancy) [19:15:29] (03PS4) 10Krinkle: etcd: Only support array input in wmfSetupEtcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713907 (owner: 10Ahmon Dancy) [19:15:38] (03CR) 10Krinkle: [C: 03+1] etcd: Only support array input in wmfSetupEtcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713907 (owner: 10Ahmon Dancy) [19:15:52] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Legoktm) >>! In T257066#7295473, @Quiddity wrote: > Should this be announced in the [[ https://meta.wikimedi... [19:34:11] (03PS1) 10Ebernhardson: Add wcqs.svc.{codfw,eqiad}.wmnet [dns] - 10https://gerrit.wikimedia.org/r/713929 (https://phabricator.wikimedia.org/T280001) [19:35:22] (03CR) 10jerkins-bot: [V: 04-1] Add wcqs.svc.{codfw,eqiad}.wmnet [dns] - 10https://gerrit.wikimedia.org/r/713929 (https://phabricator.wikimedia.org/T280001) (owner: 10Ebernhardson) [19:38:51] (03PS2) 10Ebernhardson: Add wcqs.svc.{codfw,eqiad}.wmnet [dns] - 10https://gerrit.wikimedia.org/r/713929 (https://phabricator.wikimedia.org/T280001) [19:44:13] (03PS1) 10Gehel: Elasticsearch cookbooks: [cookbooks] - 10https://gerrit.wikimedia.org/r/713931 (https://phabricator.wikimedia.org/T280221) [19:47:40] (03CR) 10Andrew Bogott: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/713559 (owner: 10Cwhite) [19:48:18] (03PS2) 10Gehel: Elasticsearch cookbooks: [cookbooks] - 10https://gerrit.wikimedia.org/r/713931 (https://phabricator.wikimedia.org/T280221) [19:52:58] (03PS7) 10Gehel: elastic: pull out execute_on_clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/706276 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [19:53:00] (03PS3) 10Gehel: Elasticsearch cookbooks: [cookbooks] - 10https://gerrit.wikimedia.org/r/713931 (https://phabricator.wikimedia.org/T280221) [20:01:11] (03PS1) 10Ryan Kemper: link recco: use CNAME for analytics-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/713934 (https://phabricator.wikimedia.org/T285355) [20:11:06] (03PS1) 10RobH: new skus for shipment services and ram [software] - 10https://gerrit.wikimedia.org/r/713936 [20:11:57] (03CR) 10RobH: [C: 03+2] new skus for shipment services and ram [software] - 10https://gerrit.wikimedia.org/r/713936 (owner: 10RobH) [20:12:27] (03Merged) 10jenkins-bot: new skus for shipment services and ram [software] - 10https://gerrit.wikimedia.org/r/713936 (owner: 10RobH) [20:13:36] (03PS8) 10Ryan Kemper: elastic: pull out execute_on_clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/706276 (https://phabricator.wikimedia.org/T280221) [20:20:30] (03CR) 10Ryan Kemper: [C: 03+2] elastic: pull out execute_on_clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/706276 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [20:32:14] (03PS2) 10Ryan Kemper: link recco: use CNAME for analytics-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/713934 (https://phabricator.wikimedia.org/T285355) [20:32:37] (03PS1) 10Nskaggs: Fix webservice failing with error trying to raise exception [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713938 (https://phabricator.wikimedia.org/T289177) [20:34:04] (03PS3) 10Ryan Kemper: link recco: use CNAME for analytics-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/713934 (https://phabricator.wikimedia.org/T285355) [20:36:40] (03PS4) 10Ryan Kemper: Elasticsearch cookbooks: Represent ops as enum [cookbooks] - 10https://gerrit.wikimedia.org/r/713931 (https://phabricator.wikimedia.org/T280221) (owner: 10Gehel) [20:37:26] (03PS4) 10Legoktm: linkrecommendation: use CNAME for analytics-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/713934 (https://phabricator.wikimedia.org/T285355) (owner: 10Ryan Kemper) [20:37:34] (03CR) 10Legoktm: [C: 03+1] linkrecommendation: use CNAME for analytics-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/713934 (https://phabricator.wikimedia.org/T285355) (owner: 10Ryan Kemper) [20:57:42] (03PS3) 10Ebernhardson: Add wcqs.svc.{codfw,eqiad}.wmnet [dns] - 10https://gerrit.wikimedia.org/r/713929 (https://phabricator.wikimedia.org/T280001) [21:02:01] (03CR) 10Ahmon Dancy: [C: 03+1] tests: Improve testCrossDcCompatibility to catch mismatching types [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713911 (owner: 10Krinkle) [21:09:50] (03CR) 10Ahmon Dancy: [C: 03+1] Use array format to specify etcd server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713906 (owner: 10Ahmon Dancy) [21:14:55] (03CR) 10Gehel: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/713931 (https://phabricator.wikimedia.org/T280221) (owner: 10Gehel) [21:15:14] ^ feel free to merge ! [21:17:37] (03CR) 10Cwhite: [C: 03+2] openstack: add more fields to nova_fullstack_test logging [puppet] - 10https://gerrit.wikimedia.org/r/713559 (owner: 10Cwhite) [21:20:33] !log ladsgroup@mwmaint2002:~$ mwscript extensions/FlaggedRevs/maintenance/pruneRevData.php --wiki=huwiki --prune (T289249) [21:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:43] T289249: flaggedtemplates table should not keep the whole history of all revisions - https://phabricator.wikimedia.org/T289249 [21:27:30] (03CR) 10David Caro: [C: 03+1] "\@/ extra brownie points for managers coding!" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713938 (https://phabricator.wikimedia.org/T289177) (owner: 10Nskaggs) [21:57:27] (03PS1) 10Ebernhardson: [WIP] blazegraph: Setup new wcqs instances [puppet] - 10https://gerrit.wikimedia.org/r/713946 [22:01:17] (03PS2) 10Ebernhardson: [WIP] blazegraph: Setup new wcqs instances [puppet] - 10https://gerrit.wikimedia.org/r/713946 [22:09:50] (03CR) 10David Caro: [C: 03+1] Fix webservice failing with error trying to raise exception (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713938 (https://phabricator.wikimedia.org/T289177) (owner: 10Nskaggs) [22:35:00] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:41:02] PROBLEM - Check systemd state on maps2005 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_cassandra-metrics-collector.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:48:09] (03PS3) 10Ebernhardson: blazegraph: Setup new wcqs instances [puppet] - 10https://gerrit.wikimedia.org/r/713946 [22:48:11] (03PS1) 10Ebernhardson: blazegraph: Setup tls termination for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/713958 (https://phabricator.wikimedia.org/T280001) [22:48:13] (03PS1) 10Ebernhardson: blazegraph: LVS for WCQS step 1 [puppet] - 10https://gerrit.wikimedia.org/r/713959 [23:00:04] brennen: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for US Backport and Config trainingYour patch may or may not be deployed at the sole discretion of the deployer. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210819T2300). [23:00:53] * thcipriani waves [23:01:29] o/ [23:15:40] !log ended backport & config window early, as no patches were scheduled and no new attendees for this week [23:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:48] (03CR) 10Jcrespo: "recheck" [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) (owner: 10H.krishna123) [23:17:29] (03CR) 10jerkins-bot: [V: 04-1] [WIP]: Add basic tox config [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) (owner: 10H.krishna123) [23:36:19] 10SRE, 10serviceops: Run httpbb periodically - https://phabricator.wikimedia.org/T289202 (10RLazarus) Thanks for the pointer! I think if we wanted to track metrics from each run, like request latency or number of passing assertions or something, pushgateway would be the tool for the job -- but I think we don't... [23:58:29] (03PS1) 10Tim Starling: Faster mailing list construction, exclusion list [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713854 [23:58:46] (03CR) 10Tim Starling: [C: 03+2] Faster mailing list construction, exclusion list [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713854 (owner: 10Tim Starling)