[00:01:14] (03CR) 10BryanDavis: wikitech::web: remove font packages from wikitech servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735042 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [00:09:35] JJMC89: Probably not enough info to be able to act on that... [00:10:51] (I can email my @wikimedia.org email from an external provider just fine) [00:11:40] Ok. I'll ask them to file a task if it is still an issue. [00:13:26] Need to know what addresses... What email they're sending from... Provider/IP [00:32:28] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10RobH) Please note the librenms alerts didnt clear for these, until they were powered down. I need to check the settings for their power redundancy to ensure t... [00:49:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:53:47] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:12:50] cjming: Drat, sorry, I had to run out. I'll reschedule it for the next window. [05:12:31] 10SRE, 10serviceops, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Legoktm) I think I narrowed it down. If I upload using plain CLI curl, it finishes instantaneously: {P17633}. Now when I use a stripped down version of SwiftFileB... [05:26:33] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [05:29:45] PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:31:15] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 56 probes of 707 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:32:39] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 227.53 ms [05:37:21] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 6 probes of 707 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:40:05] RECOVERY - BGP status on cr2-eqord is OK: BGP OK - up: 155, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:15:03] 10SRE, 10serviceops, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Legoktm) If I use PHP's stream wrappers, literally: ` $opts = [ 'http' => [ 'method' => 'PUT', 'header' => $realHeaders, 'content' => $contents, ] ]; $ctx = st... [06:16:55] 10SRE, 10serviceops, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Legoktm) Also note that during the buster upgrade we did move from curl 7.52.1 to 7.64.0, so it could also be a regression from that. I haven't looked at the libcu... [06:40:34] 10SRE, 10serviceops, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Legoktm) I tried a few different variations on the PHP curl script, none of which made a difference: * Using `curl_setopt( $ch, CURLOPT_POSTFIELDS, $contents );`... [06:47:13] (03Abandoned) 10Filippo Giunchedi: graphite: bump fetch_timeout [puppet] - 10https://gerrit.wikimedia.org/r/734224 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [06:50:51] (03CR) 10Filippo Giunchedi: [C: 03+1] rsyslog: switch codfw TLS remote syslog destination to centrallog2002 [puppet] - 10https://gerrit.wikimedia.org/r/734405 (https://phabricator.wikimedia.org/T292196) (owner: 10Herron) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211029T0700) [07:13:30] (03PS1) 10Elukey: profile::kafka::broker: add new Kafka PKI intermediate CA option [puppet] - 10https://gerrit.wikimedia.org/r/735565 (https://phabricator.wikimedia.org/T291905) [07:15:42] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31990/console" [puppet] - 10https://gerrit.wikimedia.org/r/735565 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [07:21:25] !log stop advertisement to NaWas - T288505 [07:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:49] (03PS2) 10Elukey: profile::kafka::broker: add new Kafka PKI intermediate CA option [puppet] - 10https://gerrit.wikimedia.org/r/735565 (https://phabricator.wikimedia.org/T291905) [07:21:51] (03PS1) 10Elukey: role::kafka::test::broker: use PKI Kafka TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/735566 (https://phabricator.wikimedia.org/T291905) [07:22:52] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31991/console" [puppet] - 10https://gerrit.wikimedia.org/r/735566 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [07:33:28] (03CR) 10Ema: [C: 03+2] varnishrls.mtail: various optimizations [puppet] - 10https://gerrit.wikimedia.org/r/735383 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [07:37:49] (03CR) 10Majavah: "also potentially related: https://phabricator.wikimedia.org/T294517" [puppet] - 10https://gerrit.wikimedia.org/r/735026 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [07:38:59] (03CR) 10Ayounsi: [C: 03+2] Add drmrs network to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/732351 (https://phabricator.wikimedia.org/T283050) (owner: 10Ayounsi) [07:41:33] (03PS3) 10Elukey: profile::kafka::broker: add new Kafka PKI intermediate CA option [puppet] - 10https://gerrit.wikimedia.org/r/735565 (https://phabricator.wikimedia.org/T291905) [07:41:35] (03PS2) 10Elukey: role::kafka::test::broker: use PKI Kafka TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/735566 (https://phabricator.wikimedia.org/T291905) [07:42:38] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31992/console" [puppet] - 10https://gerrit.wikimedia.org/r/735566 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [07:44:55] ACKNOWLEDGEMENT - Host mr1-drmrs.oob IPv6 is DOWN: CRITICAL - Destination Unreachable (2001:688:0:4::2d4) ayounsi ACK [07:45:01] 10SRE, 10serviceops, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Legoktm) >>! In T275752#7467352, @Legoktm wrote: > The `stream_context_create` solution only works for files that fit under the memory limit, which is currently ~6... [07:48:55] PROBLEM - SSH on thumbor1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:53:45] (03PS4) 10Elukey: profile::kafka::broker: add new Kafka PKI intermediate CA option [puppet] - 10https://gerrit.wikimedia.org/r/735565 (https://phabricator.wikimedia.org/T291905) [07:53:47] (03PS3) 10Elukey: role::kafka::test::broker: use PKI Kafka TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/735566 (https://phabricator.wikimedia.org/T291905) [07:54:47] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31993/console" [puppet] - 10https://gerrit.wikimedia.org/r/735566 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [07:55:48] 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) setup/config PDU in drmrs ( ps1-b12 and ps1-b13) - https://phabricator.wikimedia.org/T294597 (10ayounsi) Added in https://gerrit.wikimedia.org/r/c/operations/puppet/+/732351 https://icinga.wikimedia.org/cgi-bin/icinga/st... [07:56:47] (03PS5) 10Elukey: profile::kafka::broker: add new Kafka PKI intermediate CA option [puppet] - 10https://gerrit.wikimedia.org/r/735565 (https://phabricator.wikimedia.org/T291905) [07:56:49] (03PS4) 10Elukey: role::kafka::test::broker: use PKI Kafka TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/735566 (https://phabricator.wikimedia.org/T291905) [07:57:27] (03CR) 10Ayounsi: Add new PDU for drmrs site (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735452 (https://phabricator.wikimedia.org/T294597) (owner: 10Papaul) [08:02:29] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10Volans) >>! In T290694#7467061, @RobH wrote: > Not sure why these are failing, but I'm out of mental bandwidth for them today. > > They are remotely accessibl... [08:05:00] 10SRE, 10serviceops, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Joe) Just to clarify from a previous comment from @Legoktm, the "fread" reported in his output is produced by curl when it sends the data to the client. So I'm not... [08:11:06] (03PS2) 10David Caro: general: some dev-related improvements [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/735319 (https://phabricator.wikimedia.org/T294624) [08:14:24] (03CR) 10David Caro: R:ceph::keyring: ensure all variables are defined (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735019 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [08:15:59] (03CR) 10David Caro: "Note that this requires running sudo for some commands when running the ci tests, as some permissions need changing for the nobody user to" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/735319 (https://phabricator.wikimedia.org/T294624) (owner: 10David Caro) [08:31:08] (03PS1) 10Elukey: role::ml_k8s::master: add node-role.kubernetes.io/master labels [puppet] - 10https://gerrit.wikimedia.org/r/735577 (https://phabricator.wikimedia.org/T289834) [08:32:37] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31994/console" [puppet] - 10https://gerrit.wikimedia.org/r/735577 (https://phabricator.wikimedia.org/T289834) (owner: 10Elukey) [08:45:25] 10SRE, 10serviceops, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Joe) So, we can go deeper in what's wrong here, but for now I just tested switching the URL we call in @Legoktm's `test.php` to funnel the request via envoy, and t... [08:49:51] RECOVERY - SSH on thumbor1001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:50:24] <_joe_> !log depooling mw1305 while running tests [08:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:47] (03CR) 10Vgutierrez: [C: 03+2] haproxy: Allow setting variables [puppet] - 10https://gerrit.wikimedia.org/r/735324 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [08:53:35] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/735565 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [08:56:23] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 (10ayounsi) @BBlack Looking at router config I found: ` /* Temporary for T239993 */ route 208.80.153.254/32 { next-hop 208.80.153.111; readvertise; no-resolve; } ` As 208.80.153.254 doesn't... [08:57:24] (03CR) 10Vgutierrez: [C: 03+2] cache::haproxy: Fix missing_xwd ACL [puppet] - 10https://gerrit.wikimedia.org/r/735325 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [08:58:41] (03CR) 10Vgutierrez: [C: 03+2] cache::haproxy: Bring systemd service unit up to date [puppet] - 10https://gerrit.wikimedia.org/r/735386 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:00:51] (03PS1) 10Ayounsi: Add mr1-drmrs to Homer [homer/public] - 10https://gerrit.wikimedia.org/r/735582 (https://phabricator.wikimedia.org/T283050) [09:01:22] (03CR) 10jerkins-bot: [V: 04-1] Add mr1-drmrs to Homer [homer/public] - 10https://gerrit.wikimedia.org/r/735582 (https://phabricator.wikimedia.org/T283050) (owner: 10Ayounsi) [09:01:59] (03PS2) 10Ayounsi: Add mr1-drmrs to Homer [homer/public] - 10https://gerrit.wikimedia.org/r/735582 (https://phabricator.wikimedia.org/T283050) [09:02:37] (03CR) 10Ayounsi: [C: 03+2] Add mr1-drmrs to Homer [homer/public] - 10https://gerrit.wikimedia.org/r/735582 (https://phabricator.wikimedia.org/T283050) (owner: 10Ayounsi) [09:03:36] (03CR) 10Jbond: R:ceph::keyring: ensure all variables are defined (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735019 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [09:05:13] (03PS1) 10Elukey: role::kafka::main: add ServiceOps SREs as contact/owner [puppet] - 10https://gerrit.wikimedia.org/r/735583 [09:05:34] (03CR) 10David Caro: [C: 03+1] "That clarifies my questions, lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/735019 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [09:07:22] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31996/console" [puppet] - 10https://gerrit.wikimedia.org/r/735583 (owner: 10Elukey) [09:07:32] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::kafka::main: add ServiceOps SREs as contact/owner [puppet] - 10https://gerrit.wikimedia.org/r/735583 (owner: 10Elukey) [09:07:44] (03CR) 10Jbond: [C: 03+2] "LGTM 😊" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/735319 (https://phabricator.wikimedia.org/T294624) (owner: 10David Caro) [09:10:02] 10SRE, 10serviceops, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Joe) So, what I think we know at this point is: * The problem is completely within php / curl; curl from the command line or pretty much any other client behaves c... [09:17:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] start_instance_with_prefix: fix issue when no previous instance exist [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/731885 (owner: 10David Caro) [09:25:36] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! Nice work this should simplify things :)" [puppet] - 10https://gerrit.wikimedia.org/r/735410 (owner: 10Ayounsi) [09:32:18] (03CR) 10Arturo Borrero Gonzalez: start_instance_prefix: add reusable params helpers (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/731906 (owner: 10David Caro) [09:33:13] (03CR) 10Lucas Werkmeister (WMDE): "This should be good to go, but should be thoroughly tested to avoid breakage, and I don’t want to risk deploying it today. I’ll be on vaca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735367 (https://phabricator.wikimedia.org/T294224) (owner: 10Lucas Werkmeister (WMDE)) [09:35:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/731907 (owner: 10David Caro) [09:39:11] (03CR) 10David Caro: wmcs: create composite type OpenstackIdentifier (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/731907 (owner: 10David Caro) [09:43:00] (03CR) 10Lucas Werkmeister (WMDE): Add language codes agq and mcn to wmgExtraLanguageNames (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734717 (https://phabricator.wikimedia.org/T288335) (owner: 10Mbch331) [09:47:40] 10Puppet, 10Infrastructure-Foundations, 10GitLab (Infrastructure), 10Patch-For-Review, and 3 others: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 (10Jelto) 05Openβ†’03Resolved Great work, thank you @Arnoldokoth and @Dzahn . I updated the documentation in [GitLab/Backup_an... [09:48:58] (03CR) 10Jbond: [C: 04-1] "see inline" [puppet] - 10https://gerrit.wikimedia.org/r/735410 (owner: 10Ayounsi) [09:49:40] (03CR) 10David Caro: start_instance_prefix: add reusable params helpers (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/731906 (owner: 10David Caro) [09:51:57] (03PS2) 10David Caro: start_instance_prefix: add reusable params helpers [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/731906 [09:54:37] (03CR) 10jerkins-bot: [V: 04-1] start_instance_prefix: add reusable params helpers [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/731906 (owner: 10David Caro) [09:54:51] (03CR) 10Arturo Borrero Gonzalez: start_instance_with_prefix: Group options in a dataclass (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/731908 (owner: 10David Caro) [09:59:23] 10SRE, 10serviceops, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10akosiaris) {F34715054} Adding a statistical analysis of packet lengths in wireshark from 2 captures. Upper one, with 3323 packets in total is standard curl call,... [10:04:21] (03CR) 10David Caro: start_instance_with_prefix: Group options in a dataclass (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/731908 (owner: 10David Caro) [10:04:58] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "It took me a while to understand what was going on, but eventually got it :-P" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/731910 (owner: 10David Caro) [10:08:14] (03CR) 10Arturo Borrero Gonzalez: start_instance_with_prefix: fix next instance counter (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/731911 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [10:08:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] start_instance_with_prefix: add tries parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/731912 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [10:09:35] (03CR) 10David Caro: start_instance_with_prefix: allow integer-suffixed prefixes (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/731910 (owner: 10David Caro) [10:11:24] (03PS1) 10Jbond: O:cluster::management: only send out systemd emails in production [puppet] - 10https://gerrit.wikimedia.org/r/735585 [10:12:22] 10SRE, 10serviceops, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Xover) I don't think curl sends `Expect: 100-continue` for chunked transfers to begin with, and I don't think chunks need to be ack'ed before sending the next in H... [10:13:21] (03CR) 10Jbond: [C: 03+2] "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/735585 (owner: 10Jbond) [10:14:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] start_instance_with_prefix: work around extra stderr message (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/731913 (owner: 10David Caro) [10:16:37] (03PS1) 10Jbond: Profile::Base::remote_syslog_tls: expects a hash not an array [puppet] - 10https://gerrit.wikimedia.org/r/735587 [10:16:47] (03PS7) 10Vgutierrez: cache: Provide a HAproxy upload role [puppet] - 10https://gerrit.wikimedia.org/r/734209 (https://phabricator.wikimedia.org/T290005) [10:16:49] (03PS7) 10Vgutierrez: cache: Expose prometheus metrics for HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/734223 (https://phabricator.wikimedia.org/T290005) [10:16:51] (03PS1) 10Vgutierrez: haproxy: Enable TFO by default for the tls terminator [puppet] - 10https://gerrit.wikimedia.org/r/735588 (https://phabricator.wikimedia.org/T290005) [10:17:04] (03CR) 10Jbond: [C: 03+2] "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/735587 (owner: 10Jbond) [10:20:05] (03PS1) 10Jbond: cluster::managment: add default [puppet] - 10https://gerrit.wikimedia.org/r/735591 [10:20:26] (03CR) 10Jbond: [V: 03+2 C: 03+2] cluster::managment: add default [puppet] - 10https://gerrit.wikimedia.org/r/735591 (owner: 10Jbond) [10:20:36] !log hnowlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [10:20:36] !log hnowlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [10:20:39] (03CR) 10Vgutierrez: [C: 03+2] haproxy: Enable TFO by default for the tls terminator [puppet] - 10https://gerrit.wikimedia.org/r/735588 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [10:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:25] (03CR) 10Vgutierrez: [C: 03+2] haproxy: Enable TFO by default for the tls terminator [puppet] - 10https://gerrit.wikimedia.org/r/735588 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [10:24:51] 10SRE, 10serviceops, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10akosiaris) > Can we force cmdline curl to HTTP/2, or PHP libcurl to HTTP/1.1, to test that? Right on target! Thanks for noticing that, that's the issue. HTTP2 sho... [10:27:19] 10SRE, 10serviceops, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Joe) Yes, thanks for noticing @Xover; a posteriori, it's pretty obvious what is going on and it's interesting how much more inefficient using http2 is in this case... [10:29:06] (03CR) 10David Caro: start_instance_with_prefix: work around extra stderr message (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/731913 (owner: 10David Caro) [10:30:35] (03PS1) 10David Caro: global: remove future parser support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/735593 (https://phabricator.wikimedia.org/T294539) [10:33:05] (03PS1) 10Jbond: P:cluster: disable emails alerts in pontoon environments [puppet] - 10https://gerrit.wikimedia.org/r/735594 [10:33:34] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:cluster: disable emails alerts in pontoon environments [puppet] - 10https://gerrit.wikimedia.org/r/735594 (owner: 10Jbond) [10:37:25] 10SRE, 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10elukey) >>! In T291905#7435799, @jbond wrote: >>>! In T291905#7435523, @jbond wrote: >>>>! In T291905#7431136,... [10:38:04] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade. - elukey@cumin1001 [10:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:26] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/735593 (https://phabricator.wikimedia.org/T294539) (owner: 10David Caro) [10:41:42] (03CR) 10Jbond: C:statistics::compute: correct user param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735029 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [10:45:31] (03CR) 10Jbond: [C: 03+1] ipmi: allow to hide parts of the command [software/spicerack] - 10https://gerrit.wikimedia.org/r/735421 (owner: 10Volans) [10:48:57] 10SRE, 10serviceops, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Xover) [[ https://github.com/curl/curl/pull/2709 | #2709 ]] landed in libcurl 7.62.0 and enabled HTTP/2 multiplexing by default when available, so Buster would ind... [10:54:29] 10SRE, 10serviceops, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Xover) >>! In T275752#7467725, @akosiaris wrote: > setting `curl_setopt( $ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1);` in the code fixes it Hmm. Is HTTP/2 a... [10:54:40] (03PS1) 10David Caro: workspace: Improve docs and create missing directory [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/735596 [10:55:06] (03PS1) 10Jbond: P:docker::storage: Add defaults for physical_volumes and vg_to_remove [puppet] - 10https://gerrit.wikimedia.org/r/735597 (https://phabricator.wikimedia.org/T294517) [10:55:23] (03CR) 10Jbond: [C: 03+2] R:ceph::keyring: ensure all variables are defined [puppet] - 10https://gerrit.wikimedia.org/r/735019 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [10:55:51] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31998/console" [puppet] - 10https://gerrit.wikimedia.org/r/735597 (https://phabricator.wikimedia.org/T294517) (owner: 10Jbond) [10:56:58] 10SRE, 10serviceops, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Joe) >>! In T275752#7467817, @Xover wrote: >>>! In T275752#7467725, @akosiaris wrote: >> setting `curl_setopt( $ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1);`... [10:58:33] !log stashing on mwdebug1001 [10:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:26] (03PS1) 10Arturo Borrero Gonzalez: sslcert: introduce ca_deselect_dstx3 [puppet] - 10https://gerrit.wikimedia.org/r/735599 (https://phabricator.wikimedia.org/T292291) [10:59:28] (03PS1) 10Arturo Borrero Gonzalez: toolforge: exclude DST_Root_CA_X3 [puppet] - 10https://gerrit.wikimedia.org/r/735600 (https://phabricator.wikimedia.org/T292289) [11:00:24] !log [urbanecm@mwdebug1001 /srv/mediawiki]$ scap pull # livehacking ended [11:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:32] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] sslcert: introduce ca_deselect_dstx3 [puppet] - 10https://gerrit.wikimedia.org/r/735599 (https://phabricator.wikimedia.org/T292291) (owner: 10Arturo Borrero Gonzalez) [11:02:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: exclude DST_Root_CA_X3 [puppet] - 10https://gerrit.wikimedia.org/r/735600 (https://phabricator.wikimedia.org/T292289) (owner: 10Arturo Borrero Gonzalez) [11:04:19] (03CR) 10Arturo Borrero Gonzalez: "heads up, I merged the class here:" [puppet] - 10https://gerrit.wikimedia.org/r/725331 (https://phabricator.wikimedia.org/T292291) (owner: 10BBlack) [11:05:04] (03PS2) 10Jbond: P:docker::storage: Add defaults for physical_volumes and vg_to_remove [puppet] - 10https://gerrit.wikimedia.org/r/735597 (https://phabricator.wikimedia.org/T294517) [11:05:48] (03CR) 10Btullis: [C: 03+1] role::kafka::test::broker: use PKI Kafka TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/735566 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [11:05:50] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32002/console" [puppet] - 10https://gerrit.wikimedia.org/r/735597 (https://phabricator.wikimedia.org/T294517) (owner: 10Jbond) [11:05:53] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32001/console" [puppet] - 10https://gerrit.wikimedia.org/r/735597 (https://phabricator.wikimedia.org/T294517) (owner: 10Jbond) [11:06:23] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:docker::storage: Add defaults for physical_volumes and vg_to_remove [puppet] - 10https://gerrit.wikimedia.org/r/735597 (https://phabricator.wikimedia.org/T294517) (owner: 10Jbond) [11:06:52] (03CR) 10Jbond: P:docker::engine: ensure we include all required classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735026 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [11:12:05] (03PS1) 10Jbond: P:docker::engine: don blindly include profile::docker::engine [puppet] - 10https://gerrit.wikimedia.org/r/735603 (https://phabricator.wikimedia.org/T294517) [11:12:25] (03CR) 10Jbond: "and: https://gerrit.wikimedia.org/r/c/operations/puppet/+/735603" [puppet] - 10https://gerrit.wikimedia.org/r/735026 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [11:13:00] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32003/console" [puppet] - 10https://gerrit.wikimedia.org/r/735603 (https://phabricator.wikimedia.org/T294517) (owner: 10Jbond) [11:14:02] (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32004/console" [puppet] - 10https://gerrit.wikimedia.org/r/735597 (https://phabricator.wikimedia.org/T294517) (owner: 10Jbond) [11:14:15] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:docker::engine: don blindly include profile::docker::engine [puppet] - 10https://gerrit.wikimedia.org/r/735603 (https://phabricator.wikimedia.org/T294517) (owner: 10Jbond) [11:53:09] PROBLEM - SSH on thumbor1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:04:05] (03Abandoned) 10Arturo Borrero Gonzalez: cloud: ceph: refactor rbd client profile for cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/731933 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [12:05:50] (03PS1) 10MMandere: netboot: Add drmrs DC site ip routes [puppet] - 10https://gerrit.wikimedia.org/r/735608 (https://phabricator.wikimedia.org/T282787) [12:09:59] (03PS15) 10Jelto: services: deploy services with helm3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/721301 (https://phabricator.wikimedia.org/T251305) [12:18:42] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade. - elukey@cumin1001 [12:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10cmooney) I'll try and sum up what my thought process on this was. Firstly the security consideration is that we will have cloudswi... [12:43:22] (03PS1) 10Arturo Borrero Gonzalez: ceph: introduce auth load abstraction [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) [12:43:57] (03CR) 10jerkins-bot: [V: 04-1] ceph: introduce auth load abstraction [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [12:49:45] (03CR) 10JMeybohm: [C: 03+1] services: deploy services with helm3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/721301 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [12:53:53] RECOVERY - SSH on thumbor1001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:04:01] (03CR) 10Lucas Werkmeister (WMDE): Add language codes agq and mcn to wmgExtraLanguageNames (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734717 (https://phabricator.wikimedia.org/T288335) (owner: 10Mbch331) [13:05:30] (03CR) 10Ottomata: [C: 03+1] "Very cool!" [puppet] - 10https://gerrit.wikimedia.org/r/735565 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [13:08:06] (03CR) 10Ottomata: "Huh, weird. I'll look at this along with T291384 when I get to it." [puppet] - 10https://gerrit.wikimedia.org/r/735029 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [13:15:22] 10SRE, 10serviceops, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Joe) While reading the code of MultiHttpClient, I found that is tries to support pipelining, which was removed from curl completely (and was already disabled in cu... [13:18:49] 10SRE, 10serviceops, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Reedy) >>! In T275752#7468119, @Joe wrote: > While reading the code of MultiHttpClient, I found that is tries to support pipelining, which was removed from curl co... [13:28:18] (03CR) 10Ema: [C: 03+1] puppetboard: add puppetboard as an active/active service [puppet] - 10https://gerrit.wikimedia.org/r/734263 (owner: 10Jbond) [13:30:48] (03CR) 10Elukey: profile::kafka::broker: add new Kafka PKI intermediate CA option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735565 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [13:36:34] (03CR) 10Elukey: [C: 03+2] profile::kafka::broker: add new Kafka PKI intermediate CA option [puppet] - 10https://gerrit.wikimedia.org/r/735565 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [13:40:55] (03PS2) 10Mbch331: Add language codes agq and mcn to wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734717 (https://phabricator.wikimedia.org/T288335) [13:42:01] (03CR) 10Mbch331: Add language codes agq and mcn to wmgExtraLanguageNames (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734717 (https://phabricator.wikimedia.org/T288335) (owner: 10Mbch331) [13:48:48] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::kafka::test::broker: use PKI Kafka TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/735566 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [14:00:22] (03PS1) 10Ottomata: Include base::puppet in profile::puppetmaster::pontoon [puppet] - 10https://gerrit.wikimedia.org/r/735650 [14:01:56] (03CR) 10Filippo Giunchedi: [C: 03+1] Include base::puppet in profile::puppetmaster::pontoon [puppet] - 10https://gerrit.wikimedia.org/r/735650 (owner: 10Ottomata) [14:12:17] (03PS1) 10Ottomata: pontoon - Don't use http proxy for apt [puppet] - 10https://gerrit.wikimedia.org/r/735654 [14:19:19] (03CR) 10Filippo Giunchedi: [C: 03+1] pontoon - Don't use http proxy for apt [puppet] - 10https://gerrit.wikimedia.org/r/735654 (owner: 10Ottomata) [14:19:22] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Add language codes agq and mcn to wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734717 (https://phabricator.wikimedia.org/T288335) (owner: 10Mbch331) [14:24:10] (03PS1) 10Jbond: C:cert: create chained certificate using root [puppet] - 10https://gerrit.wikimedia.org/r/735656 [14:25:19] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32005/console" [puppet] - 10https://gerrit.wikimedia.org/r/735656 (owner: 10Jbond) [14:25:28] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:cert: create chained certificate using root [puppet] - 10https://gerrit.wikimedia.org/r/735656 (owner: 10Jbond) [14:45:55] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-test1008 is CRITICAL: 10 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=test-eqiad&var-kafka_broker=kafka-test1008 [14:51:06] (03PS1) 10Jbond: P:kafaka::broker: use chained file for p12 [puppet] - 10https://gerrit.wikimedia.org/r/735658 [14:52:00] (03CR) 10Elukey: [C: 03+1] P:kafaka::broker: use chained file for p12 [puppet] - 10https://gerrit.wikimedia.org/r/735658 (owner: 10Jbond) [14:53:35] (03CR) 10Elukey: [V: 03+1 C: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32006/console" [puppet] - 10https://gerrit.wikimedia.org/r/735658 (owner: 10Jbond) [14:54:13] (03CR) 10Elukey: [V: 03+1 C: 03+2] P:kafaka::broker: use chained file for p12 [puppet] - 10https://gerrit.wikimedia.org/r/735658 (owner: 10Jbond) [14:54:16] (03CR) 10Herron: [C: 03+1] hieradata/hosts: removing non-existent host files [puppet] - 10https://gerrit.wikimedia.org/r/735082 (owner: 10Dzahn) [14:58:39] (03Abandoned) 10BBlack: Switch all edge unified certs to digicert-2020 [puppet] - 10https://gerrit.wikimedia.org/r/723510 (owner: 10BBlack) [14:59:19] (03Abandoned) 10BBlack: Temporarily block certain IABot reqs that are broken and spammy [puppet] - 10https://gerrit.wikimedia.org/r/647854 (owner: 10BBlack) [14:59:58] (03Abandoned) 10BBlack: depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/570736 (owner: 10BBlack) [15:00:23] (03Abandoned) 10BBlack: Bytedance: further reqrate reduction [puppet] - 10https://gerrit.wikimedia.org/r/730789 (owner: 10BBlack) [15:01:45] (03CR) 10Elukey: [C: 03+1] "LGTM! (analytics* part)" [puppet] - 10https://gerrit.wikimedia.org/r/735082 (owner: 10Dzahn) [15:03:25] (03PS2) 10David Caro: ceph: introduce auth load abstraction [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [15:03:27] (03PS1) 10David Caro: p:ceph::osd: fix tests [puppet] - 10https://gerrit.wikimedia.org/r/735660 [15:03:49] (03PS3) 10David Caro: DONOTMERGE ceph: introduce auth load abstraction [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [15:04:01] (03PS3) 10BBlack: Add wikiworkshop.org to HSTS regex [puppet] - 10https://gerrit.wikimedia.org/r/723590 (https://phabricator.wikimedia.org/T251732) [15:04:45] (03CR) 10BBlack: [C: 03+2] Add wikiworkshop.org to HSTS regex [puppet] - 10https://gerrit.wikimedia.org/r/723590 (https://phabricator.wikimedia.org/T251732) (owner: 10BBlack) [15:04:57] (03CR) 10jerkins-bot: [V: 04-1] DONOTMERGE ceph: introduce auth load abstraction [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [15:06:51] (03CR) 10David Caro: [C: 03+2] global: remove future parser support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/735593 (https://phabricator.wikimedia.org/T294539) (owner: 10David Caro) [15:09:50] (03PS1) 10Papaul: Add new PDU for drmrs site [puppet] - 10https://gerrit.wikimedia.org/r/735661 (https://phabricator.wikimedia.org/T294597) [15:10:23] (03PS4) 10BBlack: sslcert::ca_deselect_dstx3 for envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/725331 (https://phabricator.wikimedia.org/T292291) [15:11:07] (03CR) 10Papaul: [C: 03+2] Add new PDU for drmrs site [puppet] - 10https://gerrit.wikimedia.org/r/735661 (https://phabricator.wikimedia.org/T294597) (owner: 10Papaul) [15:12:04] (03CR) 10jerkins-bot: [V: 04-1] sslcert::ca_deselect_dstx3 for envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/725331 (https://phabricator.wikimedia.org/T292291) (owner: 10BBlack) [15:13:27] (03Merged) 10jenkins-bot: global: remove future parser support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/735593 (https://phabricator.wikimedia.org/T294539) (owner: 10David Caro) [15:17:11] 10SRE, 10CirrusSearch, 10Discovery-Search, 10Infrastructure-Foundations, and 6 others: Half a million of CirrusSearch jobqueue execution errors per hour since 2021-09-30 16:02 - https://phabricator.wikimedia.org/T292291 (10BBlack) I've rebased https://gerrit.wikimedia.org/r/c/operations/puppet/+/725331 ont... [15:17:27] (03PS3) 10JMeybohm: Add basic ingress support to chart common_templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/732374 (https://phabricator.wikimedia.org/T290966) [15:21:16] (03PS2) 10David Caro: start_instance_with_prefix: fix next index generation [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/731885 [15:21:18] (03PS3) 10David Caro: start_instance_prefix: add reusable params helpers [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/731906 [15:21:20] (03PS2) 10David Caro: wmcs: create composite type OpenstackIdentifier [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/731907 [15:21:22] (03PS3) 10David Caro: start_instance_with_prefix: Group options in a dataclass [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/731908 [15:21:24] (03PS3) 10David Caro: start_instance_with_prefix: add tries parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/731912 (https://phabricator.wikimedia.org/T292465) [15:21:26] (03PS3) 10David Caro: start_instance_with_prefix: work around extra stderr message [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/731913 [15:21:28] (03PS5) 10David Caro: toolforge: new add_grid_webgrid_generic_node recipe [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/726894 (https://phabricator.wikimedia.org/T292465) [15:22:12] (03Abandoned) 10David Caro: start_instance_with_prefix: allow integer-suffixed prefixes [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/731910 (owner: 10David Caro) [15:22:19] (03Abandoned) 10David Caro: start_instance_with_prefix: fix next instance counter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/731911 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [15:22:32] (03PS2) 10BBlack: discovery-map: update geoip for wmcs spaces [puppet] - 10https://gerrit.wikimedia.org/r/626655 [15:24:53] (03CR) 10jerkins-bot: [V: 04-1] toolforge: new add_grid_webgrid_generic_node recipe [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/726894 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [15:27:34] (03PS1) 10Elukey: Revert "role::kafka::test::broker: use PKI Kafka TLS certificates" [puppet] - 10https://gerrit.wikimedia.org/r/735634 [15:28:24] (03PS1) 10Papaul: Add model sentry4 to both PDU's for drmrs [puppet] - 10https://gerrit.wikimedia.org/r/735662 (https://phabricator.wikimedia.org/T294597) [15:28:55] (03CR) 10jerkins-bot: [V: 04-1] Add model sentry4 to both PDU's for drmrs [puppet] - 10https://gerrit.wikimedia.org/r/735662 (https://phabricator.wikimedia.org/T294597) (owner: 10Papaul) [15:29:11] (03CR) 10Elukey: [C: 03+2] Revert "role::kafka::test::broker: use PKI Kafka TLS certificates" [puppet] - 10https://gerrit.wikimedia.org/r/735634 (owner: 10Elukey) [15:31:09] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-test1008 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=test-eqiad&var-kafka_broker=kafka-test1008 [15:31:29] this is me --^ [15:36:02] (03PS1) 10BBlack: discovery-map: add drmrs ranges [puppet] - 10https://gerrit.wikimedia.org/r/735665 (https://phabricator.wikimedia.org/T282787) [15:36:28] (03CR) 10BBlack: [C: 03+2] discovery-map: update geoip for wmcs spaces [puppet] - 10https://gerrit.wikimedia.org/r/626655 (owner: 10BBlack) [15:37:55] (03CR) 10BBlack: [C: 03+2] discovery-map: add drmrs ranges [puppet] - 10https://gerrit.wikimedia.org/r/735665 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [15:38:26] (03PS2) 10BBlack: discovery-map: add drmrs ranges [puppet] - 10https://gerrit.wikimedia.org/r/735665 (https://phabricator.wikimedia.org/T282787) [15:40:48] (03PS2) 10Papaul: Add model sentry4 to both PDU's for drmrs [puppet] - 10https://gerrit.wikimedia.org/r/735662 (https://phabricator.wikimedia.org/T294597) [15:42:39] (03CR) 10Papaul: [C: 03+2] Add model sentry4 to both PDU's for drmrs [puppet] - 10https://gerrit.wikimedia.org/r/735662 (https://phabricator.wikimedia.org/T294597) (owner: 10Papaul) [15:45:44] 10SRE, 10serviceops, 10Patch-For-Review, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Xover) "[[ https://blog.cloudflare.com/delivering-http-2-upload-speed-improvements/ | Delivering HTTP/2 upload speed improvements ]]" from Cl... [15:50:37] (03CR) 10Ahmon Dancy: [C: 04-1] "Not working right. Investigating." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/732737 (owner: 10Ahmon Dancy) [15:57:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={dragonfly_dfdaemon,minio} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:57:28] (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [15:58:04] we had an eqiad traffic-drop alert too, showing this odd data: https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cache_type=text [15:58:17] I assume thanos is related and traffic data is just wrong [15:59:00] (03Abandoned) 10AOkoth: gitlab: enable backup restore timer [puppet] - 10https://gerrit.wikimedia.org/r/735435 (https://phabricator.wikimedia.org/T285867) (owner: 10AOkoth) [16:02:28] (ThanosRuleHighRuleEvaluationFailures) resolved: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [16:04:35] (03PS1) 10Btullis: Add more alerts to the data-engineering team [alerts] - 10https://gerrit.wikimedia.org/r/735669 (https://phabricator.wikimedia.org/T293399) [16:06:29] 10SRE, 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10elukey) A lot of progresses today with @jbond, here's a summary: * The new keystore contains the intermediate... [16:21:48] 10SRE, 10serviceops: rename OTRS role/module/cumin aliases - https://phabricator.wikimedia.org/T293942 (10Dzahn) a:05Dzahnβ†’03Arnoldokoth [16:22:14] 10SRE, 10serviceops: rename OTRS role/module/cumin aliases - https://phabricator.wikimedia.org/T293942 (10Dzahn) 05Stalledβ†’03Open p:05Triageβ†’03Low [16:22:30] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/CentralAuth/maintenance/attachAccount.php --wiki=foundationwiki --userlist users.txt # T205347, users.txt is at P17639 [16:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:36] T205347: Enable SUL accounts on Governance wiki - https://phabricator.wikimedia.org/T205347 [16:37:33] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/CentralAuth/maintenance/attachAccount.php --wiki=foundationwiki --userlist users.txt # T205347, users.txt is at P17640 [16:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:40] T205347: Enable SUL accounts on Governance wiki - https://phabricator.wikimedia.org/T205347 [16:41:28] !log Connect Babel AutoCreate@foundationwiki to SUL (T205347) [16:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:27] (03PS1) 10Urbanecm: foundationwiki: Disable direct account creation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735674 (https://phabricator.wikimedia.org/T205347) [16:53:44] (03CR) 10Mbch331: Add missing termbox codes from Wikibase (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) (owner: 10Mbch331) [17:42:57] !log [urbanecm@mwmaint1002 /srv/mediawiki/php/maintenance]$ mwscript reassignEdits.php --wiki=foundationwiki --norc 'Neil P. Quinn-WMF' 'Neil Shah-Quinn (WMF)' # part of SUL finalisation at foundationwiki, T205347 [17:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:05] T205347: Enable SUL accounts on Governance wiki - https://phabricator.wikimedia.org/T205347 [17:47:34] !log Connect Neil Shah-Quinn (WMF)@foundationwiki to SUL (T205347) [17:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:59] (03PS1) 10Odder: Amend wordmark for the Meetei (Manipuri) Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735679 (https://phabricator.wikimedia.org/T294189) [18:26:24] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10RobH) >>! In T290694#7467448, @Volans wrote: >>>! In T290694#7467061, @RobH wrote: >> Not sure why these are failing, but I'm out of mental bandwidth for them... [18:37:32] (03PS2) 10Ahmon Dancy: php-fpm: Add settings to control debuggability [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/732737 [18:42:20] (03CR) 10Dzahn: [C: 03+2] "thanks all! also all 404 here: https://puppet-compiler.wmflabs.org/compiler1002/32007/" [puppet] - 10https://gerrit.wikimedia.org/r/735082 (owner: 10Dzahn) [18:43:02] (03PS3) 10Ahmon Dancy: php-fpm: Add settings to control debuggability [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/732737 [18:46:28] (03PS2) 10Ideophagous: updated arywiki namespaces as per T291737 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731290 [18:48:26] (03CR) 10Ahmon Dancy: [C: 03+1] "Rebased for 7.4, revised, and tested." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/732737 (owner: 10Ahmon Dancy) [19:10:31] (03CR) 10Jdlrobson: [C: 03+1] "Looks good to me!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735679 (https://phabricator.wikimedia.org/T294189) (owner: 10Odder) [19:15:18] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host cp4033.ulsfo.wmnet with OS buster [19:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:25] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cp4033.ulsfo.wmnet with OS buster [19:38:46] (03CR) 10Dzahn: "*nod* thank you! puppetboard is happy" [puppet] - 10https://gerrit.wikimedia.org/r/735026 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [19:41:30] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10RobH) 05Openβ†’03In progress UEFI boot mode was enabled, which is why it was failing rather than attempting to actually hit our PXE server. Changed to bios... [19:44:24] 10SRE, 10serviceops: rename OTRS role/module/cumin aliases - https://phabricator.wikimedia.org/T293942 (10Dzahn) discussed in 1:1 with Arnold [19:46:38] (03PS1) 10Dzahn: mediawiki: remove font packages from all canary appservers [puppet] - 10https://gerrit.wikimedia.org/r/735685 (https://phabricator.wikimedia.org/T294378) [19:48:15] (03CR) 10Dzahn: "Wasn't there also "labwebtest" or so at some point where we could potentially test it before prod Wikitech wiki?" [puppet] - 10https://gerrit.wikimedia.org/r/735042 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [19:49:03] !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4033.ulsfo.wmnet with OS buster [19:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:10] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cp4033.ulsfo.wmnet with OS buster executed with errors: - cp4033 (*... [19:50:19] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10RobH) hung on loading ramdisk post install reboot... not sure why [19:52:18] (03PS1) 10RobH: ferm not firm [puppet] - 10https://gerrit.wikimedia.org/r/735687 (https://phabricator.wikimedia.org/T290694) [19:52:53] (03CR) 10RobH: [C: 03+2] ferm not firm [puppet] - 10https://gerrit.wikimedia.org/r/735687 (https://phabricator.wikimedia.org/T290694) (owner: 10RobH) [19:56:37] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host cp4033.ulsfo.wmnet with OS buster [19:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:46] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cp4033.ulsfo.wmnet with OS buster [20:00:09] (03PS1) 10Ottomata: [WIP] profile::analytics::database::mariadb_multi [puppet] - 10https://gerrit.wikimedia.org/r/735688 (https://phabricator.wikimedia.org/T284150) [20:00:40] (03CR) 10jerkins-bot: [V: 04-1] [WIP] profile::analytics::database::mariadb_multi [puppet] - 10https://gerrit.wikimedia.org/r/735688 (https://phabricator.wikimedia.org/T284150) (owner: 10Ottomata) [20:01:42] (03PS2) 10Ottomata: [WIP] profile::analytics::database::mariadb_multi [puppet] - 10https://gerrit.wikimedia.org/r/735688 (https://phabricator.wikimedia.org/T284150) [20:02:13] (03CR) 10jerkins-bot: [V: 04-1] [WIP] profile::analytics::database::mariadb_multi [puppet] - 10https://gerrit.wikimedia.org/r/735688 (https://phabricator.wikimedia.org/T284150) (owner: 10Ottomata) [20:02:27] (03CR) 10Ottomata: [WIP] profile::analytics::database::mariadb_multi (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735688 (https://phabricator.wikimedia.org/T284150) (owner: 10Ottomata) [20:02:32] (03PS3) 10Ottomata: [WIP] profile::analytics::database::mariadb_multi [puppet] - 10https://gerrit.wikimedia.org/r/735688 (https://phabricator.wikimedia.org/T284150) [20:12:07] (03CR) 10Urbanecm: wikitech::web: remove font packages from wikitech servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735042 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [20:14:46] (03CR) 10Urbanecm: [C: 04-1] "Thanks for the patch! Left some notes inline. The most important one is indentation (soon, jenkins will probably complain about it too) an" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731290 (owner: 10Ideophagous) [20:14:49] (03CR) 10Urbanecm: [C: 04-1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731290 (owner: 10Ideophagous) [20:15:04] (03CR) 10Tacsipacsi: wikireplicas: add Translate extension tables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735088 (https://phabricator.wikimedia.org/T289952) (owner: 10AntiCompositeNumber) [20:15:35] (03CR) 10jerkins-bot: [V: 04-1] updated arywiki namespaces as per T291737 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731290 (owner: 10Ideophagous) [20:16:55] (03PS1) 10Dzahn: mariadb/icinga: page people if sanitarium master goes down [puppet] - 10https://gerrit.wikimedia.org/r/735689 (https://phabricator.wikimedia.org/T233684) [20:22:52] (03PS1) 10RobH: insetup_noferm [puppet] - 10https://gerrit.wikimedia.org/r/735692 (https://phabricator.wikimedia.org/T290694) [20:23:26] (03CR) 10RobH: [C: 03+2] insetup_noferm [puppet] - 10https://gerrit.wikimedia.org/r/735692 (https://phabricator.wikimedia.org/T290694) (owner: 10RobH) [20:32:02] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/32016/" [puppet] - 10https://gerrit.wikimedia.org/r/735689 (https://phabricator.wikimedia.org/T233684) (owner: 10Dzahn) [20:39:12] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4033.ulsfo.wmnet with OS buster [20:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:20] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cp4033.ulsfo.wmnet with OS buster completed:... [20:42:27] (03PS1) 10Dzahn: icinga: use display_name for a HOST to add '#page' string where applicable [puppet] - 10https://gerrit.wikimedia.org/r/735695 (https://phabricator.wikimedia.org/T236379) [20:42:49] mutante: maybe take the #_page out [20:43:01] many of us have it as a highlight on IRC :p [20:43:06] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host cp4034.ulsfo.wmnet with OS buster [20:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:16] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cp4034.ulsfo.wmnet with OS buster [20:43:28] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host cp4035.ulsfo.wmnet with OS buster [20:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:36] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cp4035.ulsfo.wmnet with OS buster [20:43:48] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host cp4036.ulsfo.wmnet with OS buster [20:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:56] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cp4036.ulsfo.wmnet with OS buster [20:44:17] (03CR) 10jerkins-bot: [V: 04-1] icinga: use display_name for a HOST to add '#page' string where applicable [puppet] - 10https://gerrit.wikimedia.org/r/735695 (https://phabricator.wikimedia.org/T236379) (owner: 10Dzahn) [20:44:18] Spookreeeno: legoktm: oh, lol, did not think about that, fixing [20:44:28] Np [20:44:37] (03PS2) 10Dzahn: icinga: use display_name for a HOST to add 'page' string where applicable [puppet] - 10https://gerrit.wikimedia.org/r/735695 (https://phabricator.wikimedia.org/T236379) [20:45:27] (03CR) 10Dzahn: "oh.. so while upstream Nagios host object has display_name our issues is: Monitoring::Exported_nagios_host[labweb1001]: has no parameter n" [puppet] - 10https://gerrit.wikimedia.org/r/735695 (https://phabricator.wikimedia.org/T236379) (owner: 10Dzahn) [20:46:27] (03CR) 10jerkins-bot: [V: 04-1] icinga: use display_name for a HOST to add 'page' string where applicable [puppet] - 10https://gerrit.wikimedia.org/r/735695 (https://phabricator.wikimedia.org/T236379) (owner: 10Dzahn) [20:47:36] (03CR) 10Dzahn: [C: 04-2] "we need to add this to Monitoring::Exported_nagios_host first" [puppet] - 10https://gerrit.wikimedia.org/r/735695 (https://phabricator.wikimedia.org/T236379) (owner: 10Dzahn) [20:54:30] (03PS1) 10Legoktm: MultiHttpClient: Allow setting HTTP protocol version in curl [core] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/735637 (https://phabricator.wikimedia.org/T275752) [20:54:36] (03CR) 10Legoktm: [C: 03+2] MultiHttpClient: Allow setting HTTP protocol version in curl [core] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/735637 (https://phabricator.wikimedia.org/T275752) (owner: 10Legoktm) [20:54:46] (03PS1) 10Dzahn: exported_nagios_host: add the display_name parameter [puppet] - 10https://gerrit.wikimedia.org/r/735696 (https://phabricator.wikimedia.org/T236379) [20:55:13] (03PS1) 10Legoktm: Force using HTTP 1.1 for SwiftFileBackend [core] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/735638 (https://phabricator.wikimedia.org/T275752) [20:55:21] (03CR) 10Legoktm: [C: 03+2] Force using HTTP 1.1 for SwiftFileBackend [core] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/735638 (https://phabricator.wikimedia.org/T275752) (owner: 10Legoktm) [20:57:23] (03CR) 10jerkins-bot: [V: 04-1] exported_nagios_host: add the display_name parameter [puppet] - 10https://gerrit.wikimedia.org/r/735696 (https://phabricator.wikimedia.org/T236379) (owner: 10Dzahn) [20:59:00] (03Abandoned) 10Dzahn: exported_nagios_host: add the display_name parameter [puppet] - 10https://gerrit.wikimedia.org/r/735696 (https://phabricator.wikimedia.org/T236379) (owner: 10Dzahn) [20:59:09] (03PS3) 10Dzahn: icinga: use display_name for a HOST to add 'page' string where applicable [puppet] - 10https://gerrit.wikimedia.org/r/735695 (https://phabricator.wikimedia.org/T236379) [20:59:39] (03CR) 10Dzahn: "see https://gerrit.wikimedia.org/r/c/operations/puppet/+/735695" [puppet] - 10https://gerrit.wikimedia.org/r/735696 (https://phabricator.wikimedia.org/T236379) (owner: 10Dzahn) [21:01:36] (03CR) 10jerkins-bot: [V: 04-1] icinga: use display_name for a HOST to add 'page' string where applicable [puppet] - 10https://gerrit.wikimedia.org/r/735695 (https://phabricator.wikimedia.org/T236379) (owner: 10Dzahn) [21:03:19] (03PS4) 10Dzahn: icinga: use display_name for a HOST to add 'page' string where applicable [puppet] - 10https://gerrit.wikimedia.org/r/735695 (https://phabricator.wikimedia.org/T236379) [21:06:03] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/32021/" [puppet] - 10https://gerrit.wikimedia.org/r/735695 (https://phabricator.wikimedia.org/T236379) (owner: 10Dzahn) [21:06:27] (03CR) 10Dzahn: [V: 03+1] "You can see in the compiler output how it adds the string for labweb* but not the other host." [puppet] - 10https://gerrit.wikimedia.org/r/735695 (https://phabricator.wikimedia.org/T236379) (owner: 10Dzahn) [21:06:29] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4036.ulsfo.wmnet with OS buster [21:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:37] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cp4036.ulsfo.wmnet with OS buster executed wi... [21:07:06] 10SRE, 10serviceops, 10MW-1.38-notes (1.38.0-wmf.7; 2021-11-02), 10Patch-For-Review, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Legoktm) >>! In T275752#7467700, @Xover wrote: > But then, as Lego's output above shows, cmdlin... [21:14:50] (03Merged) 10jenkins-bot: MultiHttpClient: Allow setting HTTP protocol version in curl [core] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/735637 (https://phabricator.wikimedia.org/T275752) (owner: 10Legoktm) [21:14:56] (03Merged) 10jenkins-bot: Force using HTTP 1.1 for SwiftFileBackend [core] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/735638 (https://phabricator.wikimedia.org/T275752) (owner: 10Legoktm) [21:16:00] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4035.ulsfo.wmnet with OS buster [21:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:06] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cp4035.ulsfo.wmnet with OS buster completed: - cp4035 (**PASS**)... [21:17:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:35] !log legoktm@deploy1002 Synchronized php-1.38.0-wmf.6/includes/libs/http/MultiHttpClient.php: MultiHttpClient: Allow setting HTTP protocol version in curl (T275752) (duration: 00m 57s) [21:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:41] T275752: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 [21:17:52] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4034.ulsfo.wmnet with OS buster [21:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:58] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cp4034.ulsfo.wmnet with OS buster completed: - cp4034 (**PASS**)... [21:20:00] !log legoktm@deploy1002 Synchronized php-1.38.0-wmf.6/includes/libs/filebackend/SwiftFileBackend.php: Force using HTTP 1.1 for SwiftFileBackend (T275752) (duration: 00m 55s) [21:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:27] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [21:29:07] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [21:29:42] 10SRE, 10serviceops, 10MW-1.38-notes (1.38.0-wmf.7; 2021-11-02), 10Patch-For-Review, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Legoktm) ` 2021-10-29 21:21:39 [296d3ab0-158d-47d4-b10b-25ae741838c7] mw1308 testwiki 1.38.0-wm... [21:32:49] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host cp4036.ulsfo.wmnet with OS buster [21:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:56] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cp4036.ulsfo.wmnet with OS buster [21:33:40] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10RobH) [21:38:37] (03CR) 10Urbanecm: [C: 04-2] "setting a -2 temporarily, until Lydia checks and approves (see task)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735627 (https://phabricator.wikimedia.org/T294632) (owner: 10Juan90264) [21:39:50] (03CR) 10Urbanecm: [C: 03+1] "Code looks good, SVG looks optimized => +1. Needs a backport scheduled though." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735679 (https://phabricator.wikimedia.org/T294189) (owner: 10Odder) [21:46:35] (03PS1) 10Dzahn: wikistats: refactor related to supporting bullseye with new PHP version [puppet] - 10https://gerrit.wikimedia.org/r/735698 [21:47:08] (03CR) 10jerkins-bot: [V: 04-1] wikistats: refactor related to supporting bullseye with new PHP version [puppet] - 10https://gerrit.wikimedia.org/r/735698 (owner: 10Dzahn) [21:51:15] (03PS2) 10Dzahn: wikistats: refactor related to supporting bullseye with new PHP version [puppet] - 10https://gerrit.wikimedia.org/r/735698 [21:51:50] (03CR) 10jerkins-bot: [V: 04-1] wikistats: refactor related to supporting bullseye with new PHP version [puppet] - 10https://gerrit.wikimedia.org/r/735698 (owner: 10Dzahn) [21:56:41] (03CR) 10BryanDavis: [C: 03+1] "Adding some WMCS SRE folks as reviewers so we can get this merged and shipped." [puppet] - 10https://gerrit.wikimedia.org/r/735088 (https://phabricator.wikimedia.org/T289952) (owner: 10AntiCompositeNumber) [21:57:22] (03PS3) 10Dzahn: wikistats: refactor related to supporting bullseye with new PHP version [puppet] - 10https://gerrit.wikimedia.org/r/735698 [21:57:55] (03CR) 10jerkins-bot: [V: 04-1] wikistats: refactor related to supporting bullseye with new PHP version [puppet] - 10https://gerrit.wikimedia.org/r/735698 (owner: 10Dzahn) [22:02:10] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4036.ulsfo.wmnet with OS buster [22:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:16] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cp4036.ulsfo.wmnet with OS buster completed: - cp4036 (**WARN**)... [22:03:46] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10RobH) 05In progressβ†’03Resolved [22:05:46] (03PS4) 10Dzahn: wikistats: refactor related to supporting bullseye with new PHP version [puppet] - 10https://gerrit.wikimedia.org/r/735698 [22:07:25] 10SRE, 10serviceops, 10MW-1.38-notes (1.38.0-wmf.6; 2021-10-26), 10Patch-For-Review, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Legoktm) I watched the uploads for {T292769} and they took 7s and 9s for 714MB and 864 MB files... [22:17:25] 10SRE, 10serviceops, 10MW-1.38-notes (1.38.0-wmf.6; 2021-10-26), 10Patch-For-Review, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Legoktm) [22:39:28] PROBLEM - SSH on kubernetes1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:50:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10bd808) [22:57:45] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/CentralAuth/maintenance/attachAccount.php --wiki=foundationwiki --userlist users.txt # T205347, users.txt is at P17641 [22:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:51] T205347: Enable SUL accounts on Governance wiki - https://phabricator.wikimedia.org/T205347 [23:05:46] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1137.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:11:50] (03CR) 10Cwhite: [C: 03+1] rsyslog: switch codfw TLS remote syslog destination to centrallog2002 [puppet] - 10https://gerrit.wikimedia.org/r/734405 (https://phabricator.wikimedia.org/T292196) (owner: 10Herron) [23:14:45] (03CR) 10Dzahn: [C: 03+2] wikistats: refactor related to supporting bullseye with new PHP version [puppet] - 10https://gerrit.wikimedia.org/r/735698 (owner: 10Dzahn) [23:17:11] (03CR) 10Dzahn: "noop on buster, fixed some errors (but not all) on bullseye" [puppet] - 10https://gerrit.wikimedia.org/r/735698 (owner: 10Dzahn) [23:27:47] (03PS1) 10Dzahn: wikistats: use wmflib::dir::mkdir_p, ensure dirs exist even before package [puppet] - 10https://gerrit.wikimedia.org/r/735701 [23:28:19] (03CR) 10jerkins-bot: [V: 04-1] wikistats: use wmflib::dir::mkdir_p, ensure dirs exist even before package [puppet] - 10https://gerrit.wikimedia.org/r/735701 (owner: 10Dzahn) [23:29:33] (03PS2) 10Dzahn: wikistats: use wmflib::dir::mkdir_p, ensure dirs exist before first deploy [puppet] - 10https://gerrit.wikimedia.org/r/735701 [23:30:48] PROBLEM - MariaDB Replica IO: s6 on db2141 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:31:00] (03CR) 10Juan90264: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735679 (https://phabricator.wikimedia.org/T294189) (owner: 10Odder) [23:31:40] (03CR) 10Dzahn: [C: 03+2] wikistats: use wmflib::dir::mkdir_p, ensure dirs exist before first deploy [puppet] - 10https://gerrit.wikimedia.org/r/735701 (owner: 10Dzahn) [23:34:07] (03CR) 10Dzahn: "puppet run on fresh bullseye instance now fixed, no more alert mails" [puppet] - 10https://gerrit.wikimedia.org/r/735701 (owner: 10Dzahn) [23:34:46] RECOVERY - MariaDB Replica IO: s6 on db2141 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:40:16] RECOVERY - SSH on kubernetes1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:45:37] 10SRE, 10ops-eqdfw: cr2-eqdfw: PEM 1 Input Voltage Out Of Range flapping - https://phabricator.wikimedia.org/T294009 (10Papaul) REplacing PEM1 didn't clear the alarm, PEM1 is not the issue. I unplugged PEM1 and plugged it into the same PDU with PEM0 it clears the alarm so i am thinking that the PDU where PEM1... [23:50:11] (03PS4) 10Dzahn: add miscweb to LVS [puppet] - 10https://gerrit.wikimedia.org/r/694625 (https://phabricator.wikimedia.org/T281538) [23:54:38] PROBLEM - MariaDB Replica IO: s6 on db2141 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:56:18] 10SRE, 10ops-eqdfw: cr2-eqdfw: PEM 1 Input Voltage Out Of Range flapping - https://phabricator.wikimedia.org/T294009 (10Papaul) The 2 PDU's are not tracked in Netbox. I found the setup task of eqord and eqdfw where the PDU's are listed. Please see below https://phabricator.wikimedia.org/T91077 [23:58:38] RECOVERY - MariaDB Replica IO: s6 on db2141 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica