离线安装 KubeSphere实操记录安装失败, 各种问题

最新推荐文章于 2026-04-14 12:02:56 发布

原创最新推荐文章于 2026-04-14 12:02:56 发布 · 3.8k 阅读

30 ·

本内容遵循CC 4.0 BY-SA版权协议

GEO检测

标签

#kubesphere #离线安装

各种问题专栏收录该内容

171 篇文章

订阅专栏

本实操记录, 根据官网提供的教程, 结合实际的服务器环境进行记录

离线安装 4.1教程网址: 离线安装 KubeSphere

参考如下示例准备至少三台主机。

主机 IP	主机名称	角色
172.16.20.21	node1	联网主机，用于制作离线包
172.16.21.30	master	离线环境的主节点
172.16.20.20	node2	离线环境的镜像仓库节点（若已有镜像仓库，可忽略）此服务器为我使用nexus部署的docker镜像仓库

主机 IP

主机名称

角色

172.16.20.21

node1

联网主机，用于制作离线包

172.16.21.30

master

离线环境的主节点

172.16.20.20

node2

离线环境的镜像仓库节点（若已有镜像仓库，可忽略）

此服务器为我使用nexus部署的docker镜像仓库

master 和 node2 上需安装 socat 和 conntrack。

我的服务器使用的是centos7.9, 与 Rocky Linux release 8.10 (Green Obsidian), 所以换成yum进行安装

yum install socat conntrack -y

获取版本信息及镜像列表

访问 KubeSphere Images
选择需要部署的扩展组件。这里我全选
填入邮箱地址。
点击获取镜像列表。
查看填写的邮箱，获取 KubeSphere 最新的版本信息以及镜像列表文件。

获取后, 进入邮箱检查邮件, 得到附件

文件名	描述
`kubesphere-images.txt`	包含 KubeSphere 及扩展组件涉及的所有镜像，以及在华为云的镜像地址，可根据该文件中的列表将镜像同步至离线仓库中。
`kk-manifest.yaml`	包含 KubeSphere 及扩展组件涉及的所有镜像，可使用 kk 快速构建离线包。
`kk-manifest-mirror.yaml`	包含华为云镜像仓库中 KubeSphere 及扩展组件涉及的所有镜像。访问 DockerHub 受限时可使用该 manifest 文件构建离线包。

构建离线安装包

登录可访问互联网的节点 node1，参照以下步骤构建 KubeSphere 离线安装包。

1. 安装 KubeKey

执行以下命令安装⼯具 KubeKey。

下载完成后当前目录下将生成 KubeKey 二进制文件 kk。

ssh 172.16.20.21

事先创建专用目录 mkdir /home/kubesphere

进入目录 cd /home/kubesphere

执行如下命令

curl -sSL https://get-kk.kubesphere.io | sh -

经过多次尝试, 终于下载kk成功

2. 创建 manifest 文件

下面这个描述, 会让人看不明白什么是是"只需要使用kk打包KubeSphere镜像至离线环境", 什么叫使用 "kk部署Kubernetes以及镜像仓库" , 让人读不懂, 实际上我的目标就是离线部署, 那么先不用管, 按步骤来, 后面再总结

执行命令

export KKZONE=cn


# 如需使用 kk 离线部署镜像仓库，添加 --with-registry 打包镜像仓库的安装文件
./kk create manifest --with-kubernetes v1.26.12 --with-registry

# 这里面我希望使用我自己的nexus部署的docker镜像, 所以不加 --with-registry

实际执行命令

./kk create manifest --with-kubernetes v1.26.12

该命令将创建一个 manifest-sample.yaml 文件。

3. 编辑 manifest 文件

若需要使用 kk 部署 Kubernetes 以及镜像仓库，将从邮件获取到的 KubeSphere 镜像列表添加到新创建的 manifest 文件中即可。

上面这句话让人很费解, 什么叫"若需要使用 kk 部署 Kubernetes 以及镜像仓库"? 那肯定需要呀

镜像仓库如果使用本地, 没有单独交待如何处理, 按教程接着往下走

打开 manifest 文件。

vi manifest-sample.yaml

复制 kk-manifest.yaml 或 kk-manifest-mirror.yaml（若访问 DockerHub 受限）中的镜像列表，添加到新创建的 manifest-sample.yaml 文件中。

那么这里面的意思是根据你的网络来选择复制哪个文件, 我在国内肯定选择 kk-manifest-mirror.yaml 中的文件进行复制, 将 kk-manifest-mirror.yaml中的内容复制到 manifest-sample.yaml , 注意上面的红色加粗字, 仅复制镜像列表, 其它内容不要复制, 贴到 manifest-sample.yaml 镜像列表的后面 spec.images 下方, 为啥确定是复制kk-manifest-mirror.yaml文件中的内容, 从名称上看, 一个是原始的, 一个是镜像, 而且 kk-manifest-mirror.yaml中的网址是如下的企业镜像网址 swr.cn-southwest-2.myhuaweicloud.com

修改后的内容参见绑定资源: manifest-sample.yaml

4. 构建离线包

执行以下命令构建包含 ks-core 及各扩展组件镜像的离线安装包。

./kk artifact export -m manifest-sample.yaml -o kubesphere.tar.gz

这里推荐使用后台运行 因为需要时间比较长, 容易断线

增加变量: export KKZONE=cn 

nohup ./kk artifact export -m manifest-sample.yaml -o kubesphere.tar.gz > k.log 2>&1 &

另外出错的时候, 主要是下载github报超时, 手工执行一下, 多操作几次就好了

执行成功后，将显示如下信息

images/index.json
images/oci-layout
kube/v1.26.12/amd64/kubeadm
kube/v1.26.12/amd64/kubectl
kube/v1.26.12/amd64/kubelet
runc/v1.1.12/amd64/runc.amd64
09:55:41 CST success: [LocalHost]
09:55:41 CST [ChownOutputModule] Chown output file
09:55:41 CST success: [LocalHost]
09:55:41 CST [ChownWorkerModule] Chown ./kubekey dir
09:55:41 CST success: [LocalHost]
09:55:41 CST Pipeline[ArtifactExportPipeline] execute successfully

5. 下载 KubeSphere Core Helm Chart

安装 helm。

curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

安装成功提示

下载 KubeSphere Core Helm Chart。
```
VERSION=1.1.3     # Chart 版本
helm fetch https://charts.kubesphere.io/main/ks-core-${VERSION}.tgz


当前目录会下载文件:  ks-core-1.1.3.tgz 
```
此处为示例版本，请访问 https://get-images.kubesphere.io 或 KubeSphere GitHub 仓库查看最新 chart 版本。

当前是2024年12月5日,看到的最新版本是 1.1.3

离线部署

1. 准备工作

将联网主机 node1 上的三个文件同步至离线环境的 master 节点。

kk
kubesphere.tar.gz
ks-core-1.1.3.tgz

这里我将这3个文件复制到 172.16.21.30

scp kk hadoop@172.16.21.30:/work1/kubesphere

scp kubesphere.tar.gz hadoop@172.16.21.30:/work1/kubesphere

scp ks-core-1.1.3.tgz hadoop@172.16.21.30:/work1/kubesphere

2. 创建配置文件

创建离线集群配置文件。

在离线的主节点, hadoop@172.16.21.30:/work1/kubesphere
执行如下命令

./kk create config --with-kubernetes v1.26.12

修改配置文件。

vi config-sample.yaml

说明
按照离线环境的实际配置修改节点信息。指定 `registry` 仓库的部署节点，用于 KubeKey 部署自建 Harbor 仓库。 `registry` 里可以指定 `type` 类型为 `harbor`，否则默认安装 docker registry。对于 Kubernetes v1.24+，建议将 `containerManager` 设置为 `containerd`。

修改后的内容如下


apiVersion: kubekey.kubesphere.io/v1alpha2
kind: Cluster
metadata:
  name: sample
spec:
  hosts:
  - {name: master, address: 172.16.21.35, internalAddress: 172.16.21.35, user: docker, password: "*******"}
  roleGroups:
    etcd:
    - master
    control-plane: 
    - master
    worker:
    - master
    # 这里需要安装 harbor
    registry:
    - master
  controlPlaneEndpoint:
    ## Internal loadbalancer for apiservers 
    # internalLoadbalancer: haproxy

    domain: lb.kubesphere.local
    address: ""
    port: 6443
  kubernetes:
    version: v1.26.12
    clusterName: cluster.local
    autoRenewCerts: true
    containerManager: containerd
  etcd:
    type: kubekey
  network:
    plugin: calico
    kubePodsCIDR: 10.233.64.0/18
    kubeServiceCIDR: 10.233.0.0/18
    ## multus support. https://github.com/k8snetworkplumbingwg/multus-cni
    multusCNI:
      enabled: false
  registry:
    # 需使用 kk 部署 harbor, 将该参数设置为 harbor
    type: harbor
    auths:
      "dockerhub.kubekey.local":
        # 部署 harbor 时需指定 harbor 帐号密码
        username: admin
        password: Harbor12345
        skipTLSVerify: true
    # 设置集群部署时使用的私有仓库地址。
    privateRegistry: "dockerhub.kubekey.local"
    # 构建离线包时 Kubernetes 镜像使用的是阿里云仓库镜像，需配置该参数。
    namespaceOverride: "kubesphereio"
    registryMirrors: []
    insecureRegistries: []
  addons: []

3. 创建镜像仓库

执行以下命令创建镜像仓库。

cd /work1/kubesphere
./kk init registry -f config-sample.yaml -a kubesphere.tar.gz


报错, 由于我是离线状态, 肯定下载不了呀


downloading amd64 harbor v2.10.1  ...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (6) Could not resolve host: github.com
14:27:10 CST [WARN] Having a problem with accessing https://storage.googleapis.com? You can try again after setting environment 'export KKZONE=cn'
14:27:10 CST message: [LocalHost]
Failed to download harbor binary: curl -L -o /work1/kubesphere/kubekey/registry/harbor/v2.10.1/amd64/harbor-offline-installer-v2.10.1.tgz https://github.com/goharbor/harbor/releases/download/v2.10.1/harbor-offline-installer-v2.10.1.tgz error: exit status 6 
14:27:10 CST failed: [LocalHost]
error: Pipeline[InitRegistryPipeline] execute failed: Module[RegistryPackageModule] exec failed: 
failed: [LocalHost] [DownloadRegistryPackage] exec failed after 1 retries: Failed to download harbor binary: curl -L -o /work1/kubesphere/kubekey/registry/harbor/v2.10.1/amd64/harbor-offline-installer-v2.10.1.tgz https://github.com/goharbor/harbor/releases/download/v2.10.1/harbor-offline-installer-v2.10.1.tgz error: exit status 6 

于是我在可上网的机器下载, 然后拷贝过去

curl -L -o /home/kubesphere/kubekey/registry/harbor/v2.10.1/amd64/harbor-offline-installer-v2.10.1.tgz https://github.com/goharbor/harbor/releases/download/v2.10.1/harbor-offline-installer-v2.10.1.tgz

scp /home/kubesphere/kubekey/registry/harbor/v2.10.1/amd64/harbor-offline-installer-v2.10.1.tgz hadoop@172.16.21.30:/work1/kubesphere/kubekey/registry/harbor/v2.10.1/amd64/



Failed to download compose binary: curl -L -o /work1/kubesphere/kubekey/registry/compose/v2.26.1/amd64/docker-compose-linux-x86_64 https://github.com/docker/compose/releases/download/v2.26.1/docker-compose-linux-x86_64 error: exit status 6 
15:23:00 CST failed: [LocalHost]
error: Pipeline[InitRegistryPipeline] execute failed: Module[RegistryPackageModule] exec failed: 
failed: [LocalHost] [DownloadRegistryPackage] exec failed after 1 retries: Failed to download compose binary: curl -L -o /work1/kubesphere/kubekey/registry/compose/v2.26.1/amd64/docker-compose-linux-x86_64 https://github.com/docker/compose/releases/download/v2.26.1/docker-compose-linux-x86_64 error: exit status 6 




解决办法一样, 手工下载, 拷贝过去


报错
16:01:58 CST failed: [master]
error: Pipeline[CreateClusterPipeline] execute failed: Module[GreetingsModule] exec failed: 
failed: [master] execute task timeout, Timeout=30s


则在 - {name: master, address: 172.16.21.35, internalAddress: 172.16.21.35, user: docker, password: "KNAVqAmJUNCCsJ2U",timeout: 1200}

加上 timeout 仍然不行, 搜索到据说是ubuntu中文环境的问题

## vi /etc/default/locale 修改语言, 然后重启
LANG="en_US.UTF-8"
LANGUAGE="en_US:en"

换可上网的机器下载也很慢呀, 于是我科学上网专门下载安

config-sample.yaml 为离线集群的配置文件。
kubesphere.tar.gz 为包含 ks-core 及各扩展组件镜像的离线安装包。

如果显示如下信息，则表明镜像仓库创建成功。

Local image registry created successfully. Address: dockerhub.kubekey.local

15:59:09 CST success: [master]
15:59:09 CST [ChownWorkerModule] Chown ./kubekey dir
15:59:09 CST success: [LocalHost]
15:59:09 CST Pipeline[InitRegistryPipeline] execute successfully

这里面貌似是安装成功了, 但是当我访问的时候,确是访问不了呀.

不论是通过ip还是通过域名, 80端口都访问不了, 但是通过 telnet 127.0.0.1 80 却是通的, 通过外网的ip就是不通, 可能被拦截了

4. 创建 harbor 项目（若镜像仓库为 Harbor）

说明

说明
由于 Harbor 项目存在访问控制（RBAC）的限制，即只有指定角色的用户才能执行某些操作。如果您未创建项目，则镜像不能被推送到 Harbor。Harbor 中有两种类型的项目：公共项目（Public）：任何用户都可以从这个项目中拉取镜像。私有项目（Private）：只有作为项目成员的用户可以拉取镜像。 Harbor 管理员账号：admin，密码：Harbor12345。 harbor 安装文件在 `/opt/harbor` 目录下，可在该目录下对 harbor 进行运维。

由于 Harbor 项目存在访问控制（RBAC）的限制，即只有指定角色的用户才能执行某些操作。如果您未创建项目，则镜像不能被推送到 Harbor。Harbor 中有两种类型的项目：

公共项目（Public）：任何用户都可以从这个项目中拉取镜像。
私有项目（Private）：只有作为项目成员的用户可以拉取镜像。

Harbor 管理员账号：admin，密码：Harbor12345。

harbor 安装文件在 /opt/harbor 目录下，可在该目录下对 harbor 进行运维。

执行以下命令创建 harbor 项目。这里面注意一个问题, https://dockerhub.kubekey.local 这个域名是自定义的, 安装 harbor时, 会写多条记录进 /etc/hosts, 如何样子

# kubekey hosts BEGIN
172.16.21.35  master.cluster.local master
172.16.21.35  dockerhub.kubekey.local
172.16.21.35  lb.kubesphere.local
# kubekey hosts END

创建脚本配置文件。

vi create_project_harbor.sh 写入如下内容, 然后执行, 写入的时候遇到各种问题, 不如多行贴到控制台执行

#!/usr/bin/env bash

# Copyright 2018 The KubeSphere Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

url="https://dockerhub.kubekey.local"  # 或修改为实际镜像仓库地址
user="admin"
passwd="Harbor12345"

harbor_projects=(
        ks
        kubesphere
        kubesphereio
        coredns
        calico
        flannel
        cilium
        hybridnetdev
        kubeovn
        openebs
        library
        plndr
        jenkins
        argoproj
        dexidp
        openpolicyagent
        curlimages
        grafana
        kubeedge
        nginxinc
        prom
        kiwigrid
        minio
        opensearchproject
        istio
        jaegertracing
        timberio
        prometheus-operator
        jimmidyson
        elastic
        thanosio
        brancz
        prometheus
)

for project in "${harbor_projects[@]}"; do
    echo "creating $project"
    curl -u "${user}:${passwd}" -X POST -H "Content-Type: application/json" "${url}/api/v2.0/projects" -d "{ \"project_name\": \"${project}\", \"public\": true}" -k  # 注意在 curl 命令末尾加上 -k
done

创建完成后进入harbor页面可以看到项目

5. 安装 Kubernetes

执行以下命令创建 Kubernetes 集群：

./kk create cluster -f config-sample.yaml -a kubesphere.tar.gz --with-local-storage

报错了

etcd health check failed: Failed to exec command: sudo -E /bin/bash -c "export ETCDCTL_API=2;export ETCDCTL_CERT_FILE='/etc/ssl/etcd/ssl/admin-master.pem';export ETCDCTL_KEY_FILE='/etc/ssl/etcd/ssl/admin-master-key.pem';export ETCDCTL_CA_FILE='/etc/ssl/etcd/ssl/ca.pem';/usr/local/bin/etcdctl --endpoints=https://172.16.21.35:2379 cluster-health | grep -q 'cluster is healthy'" 
Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.16.21.35:2379: connect: connection refused

error #0: dial tcp 172.16.21.35:2379: connect: connection refused: Process exited with status 1
11:46:09 CST retry: [master]
11:46:14 CST message: [master]
etcd health check failed: Failed to exec command: sudo -E /bin/bash -c "export ETCDCTL_API=2;export ETCDCTL_CERT_FILE='/etc/ssl/etcd/ssl/admin-master.pem';export ETCDCTL_KEY_FILE='/etc/ssl/etcd/ssl/admin-master-key.pem';export ETCDCTL_CA_FILE='/etc/ssl/etcd/ssl/ca.pem';/usr/local/bin/etcdctl --endpoints=https://172.16.21.35:2379 cluster-health | grep -q 'cluster is healthy'" 
Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.16.21.35:2379: connect: connection refused

error #0: dial tcp 172.16.21.35:2379: connect: connection refused: Process exited with status 1
11:46:14 CST retry: [master]
11:46:19 CST message: [master]
etcd health check failed: Failed to exec command: sudo -E /bin/bash -c "export ETCDCTL_API=2;export ETCDCTL_CERT_FILE='/etc/ssl/etcd/ssl/admin-master.pem';export ETCDCTL_KEY_FILE='/etc/ssl/etcd/ssl/admin-master-key.pem';export ETCDCTL_CA_FILE='/etc/ssl/etcd/ssl/ca.pem';/usr/local/bin/etcdctl --endpoints=https://172.16.21.35:2379 cluster-health | grep -q 'cluster is healthy'" 
Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.16.21.35:2379: connect: connection refused

error #0: dial tcp 172.16.21.35:2379: connect: connection refused: Process exited with status 1
11:46:19 CST retry: [master]
11:46:25 CST message: [master]
etcd health check failed: Failed to exec command: sudo -E /bin/bash -c "export ETCDCTL_API=2;export ETCDCTL_CERT_FILE='/etc/ssl/etcd/ssl/admin-master.pem';export ETCDCTL_KEY_FILE='/etc/ssl/etcd/ssl/admin-master-key.pem';export ETCDCTL_CA_FILE='/etc/ssl/etcd/ssl/ca.pem';/usr/local/bin/etcdctl --endpoints=https://172.16.21.35:2379 cluster-health | grep -q 'cluster is healthy'" 
Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.16.21.35:2379: connect: connection refused

error #0: dial tcp 172.16.21.35:2379: connect: connection refused: Process exited with status 1
11:46:25 CST retry: [master]

按如下步骤开放端口也无济于事
删除重装 
./kk delete cluster -f config-sample.yaml
./kk create cluster -f config-sample.yaml -a kubesphere.tar.gz --with-local-storage


仍然报错, 发现是安装k8s时, 会把我原来的iptables的配置清空, 所以换一台机器单独安装harbor, 然后再按上述步骤安装k8s, 跳过harbor的安装即可

开放端口

# 开放 SSH 服务
sudo iptables -A INPUT -p tcp --dport 22 -j ACCEPT

# 开放 etcd 服务
sudo iptables -A INPUT -p tcp --dport 2379 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 2380 -j ACCEPT

# 开放 apiserver 服务
sudo iptables -A INPUT -p tcp --dport 6443 -j ACCEPT

# 开放 calico 服务
sudo iptables -A INPUT -p tcp --dport 9099 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 9100 -j ACCEPT

# 开放 BGP 服务
sudo iptables -A INPUT -p tcp --dport 179 -j ACCEPT

# 开放 NodePort 服务
sudo iptables -A INPUT -p tcp --dport 30000:32767 -j ACCEPT

# 开放 Master 服务
sudo iptables -A INPUT -p tcp --dport 10250 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 10258 -j ACCEPT

# 开放 DNS 服务 (TCP)
sudo iptables -A INPUT -p tcp --dport 53 -j ACCEPT

# 开放 DNS 服务 (UDP)
sudo iptables -A INPUT -p udp --dport 53 -j ACCEPT

# 开放 metrics-server 服务
sudo iptables -A INPUT -p tcp --dport 8443 -j ACCEPT

# 开放 local-registry 服务
sudo iptables -A INPUT -p tcp --dport 5000 -j ACCEPT

# 开放 local-apt 服务
sudo iptables -A INPUT -p tcp --dport 5080 -j ACCEPT

# 开放 rpcbind 服务
sudo iptables -A INPUT -p tcp --dport 111 -j ACCEPT


检查是否生效
sudo iptables -L -n -v



默认情况下，iptables 规则在重启后会丢失。为了让规则永久生效，需要保存规则。
sudo apt update
sudo apt install iptables-persistent

保存规则

sudo netfilter-persistent save

相关报错记录

1. ethtool

[WARNING FileExisting-ethtool]: ethtool not found in system path
error execution phase preflight: [preflight] Some fatal errors occurred:


ubuntu 22.04版本 解决办法:

sudo apt-get update
sudo apt-get install ethtool

2. container runtime is not running

 [ERROR CRI]: container runtime is not running: output: time="2024-12-10T13:46:06+08:00" level=fatal msg="validate service connection: validate CRI v1 runtime API for endpoint \"unix:///run/containerd/containerd.sock\": rpc error: code = Unimplemented desc = unknown service runtime.v1.RuntimeService"


解决办法:

查看运行状态

systemctl status containerd

状态是正常的 active (running) 

sudo vi /etc/containerd/config.toml 

注释掉 disabled_plugins = ["cri"]

然后重启 containerd

sudo systemctl restart containerd


原因分析及参考
一个关于版本的背景故事
根据k8s官网的介绍，kubernets 自ｖ 1.24.0 后，移除了 docker.shim（k8s集成的docker），替换采用 containerd 作为容器运行时。因此需要安装 containerd
而containerd是docker的子项目，现在他俩分开了，所以可以单独安装containerd

关于containerd的介绍
常用的容器运行时有docker、containerd、CRI-O等
containerd是一个CRI（Container Runtime Interface）组件，在容器运行时调用containerd组件来创建、运行、销毁容器等
CRI组件遵循OCI规范，通过runc实现与操作系统内核的交互，然后实现创建和运行容器
docker使用containerd作为运行时，k8s使用containerd、CRI-O等

报错内容中的内容分析
CRI Container Runtime Interface 容器运行时接口
container runtime is not running 容器运行时未启动
validate service connection 无效的服务连接
CRI v1 runtime API is not implemented for endpoint “unix:///var/run/containerd/containerd.sock” 容器运行时接口 v1 运行时 接口 没有实现节点文件sock，应该就是此文件未找到

containerd安装的默认禁用（重点）
使用安装包安装的containerd会默认禁用作为容器运行时的功能，即安装包安装containerd后默认禁用containerd作为容器运行时
这个时候使用k8s就会报错了，因为没有容器运行时可以用
开启方法就是将/etc/containerd/config.toml文件中的disabled_plugins的值的列表中不包含cri
修改后重启containerd才会生效

3. 安装 sudo apt install ipvsadm

4.安装 sudo apt install chrony

经过以上步骤 etcd启动失败, 仍然报错

14:25:27 CST message: [master]
etcd health check failed: Failed to exec command: sudo -E /bin/bash -c "export ETCDCTL_API=2;export ETCDCTL_CERT_FILE='/etc/ssl/etcd/ssl/admin-master.pem';export ETCDCTL_KEY_FILE='/etc/ssl/etcd/ssl/admin-master-key.pem';export ETCDCTL_CA_FILE='/etc/ssl/etcd/ssl/ca.pem';/usr/local/bin/etcdctl --endpoints=https://172.16.21.35:2379 cluster-health | grep -q 'cluster is healthy'" 
Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.16.21.35:2379: connect: connection refused



检查状态

 sudo systemctl status etcd

● etcd.service - etcd
     Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: enabled)
     Active: activating (auto-restart) (Result: exit-code) since Tue 2024-12-10 14:55:26 CST; 2s ago
    Process: 6813 ExecStart=/usr/local/bin/etcd (code=exited, status=1/FAILURE)
   Main PID: 6813 (code=exited, status=1/FAILURE)

是失败状态

重新启动  sudo systemctl start etcd


查看错误信息

journalctl -xe


Dec 10 15:07:03 master etcd[8160]: {"level":"fatal","ts":"2024-12-10T15:07:03.963267+0800","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"cannot fetch cluster info from peer urls: could not retrieve cluster information from the given URLs","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\tgo.etcd.io/etcd/server/v3/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\tgo.etcd.io/etcd/server/v3/etcdmain/main.go:40\nmain.main\n\tgo.etcd.io/etcd/server/v3/main.go:31\nruntime.main\n\truntime/proc.go:267"}


发现一个问题:

telnet 172.16.21.35 2379 不通
telent 127.0.0.1 2379 是通的


通过检查配置文件  cat /etc/etcd.env

发现问题 , 修改成正常的
ETCDCTL_ENDPOINTS=https://172.16.21.35:2379  
ETCD_LISTEN_CLIENT_URLS=https://172.16.21.35:2379
ETCD_LISTEN_CLIENT_URLS=https://0.0.0.0:2379
ETCD_LISTEN_PEER_URLS=https://0.0.0.0:2380


启动仍然发现绑定 127, 检查 /etc/host, 发现了一条

127.0.1.1      master

修改成

172.16.21.35 master


发现这问题主要是通过命令行  sudo /usr/local/bin/etcd 是可以启动的 但是通过service启动失败, 因为service配置的环境变量对应的ip是 172.16.21.35

重启  sudo systemctl start etcd
 
 sudo systemctl status etcd

依然启动不了etcd, 检查错误日志 , root账号执行 journalctl -u etcd -xe > 1.log

找到最新的有价值的日志

Dec 10 16:01:22 master etcd[18172]: {"level":"info","ts":"2024-12-10T16:01:22.959584+0800","caller":"embed/etcd.go:308","msg":"starting an etcd server","etcd-version":"3.5.13","git-sha":"c9063a0dc","go-version":"go1.21.8","go-os":"linux","go-arch":"amd64","max-cpu-set":48,"max-cpu-available":48,"member-initialized":false,"name":"etcd-master","data-dir":"/var/lib/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/etcd/member","force-new-cluster":false,"heartbeat-interval":"250ms","election-timeout":"5s","initial-election-tick-advance":true,"snapshot-count":10000,"max-wals":5,"max-snapshots":5,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["https://172.16.21.35:2380"],"listen-peer-urls":["https://172.16.21.35:2380"],"advertise-client-urls":["https://172.16.21.35:2379"],"listen-client-urls":["https://172.16.21.35:2379"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"etcd-master=https://172.16.21.35:2380","initial-cluster-state":"existing","initial-cluster-token":"k8s_etcd","quota-backend-bytes":2147483648,"max-request-bytes":1572864,"max-concurrent-streams":4294967295,"pre-vote":true,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","compact-check-time-enabled":false,"compact-check-time-interval":"1m0s","auto-compaction-mode":"periodic","auto-compaction-retention":"8h0m0s","auto-compaction-interval":"8h0m0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
Dec 10 16:01:22 master etcd[18172]: {"level":"warn","ts":"2024-12-10T16:01:22.959654+0800","caller":"fileutil/fileutil.go:53","msg":"check file permission","error":"directory \"/var/lib/etcd\" exist, but the permission is \"drwxr-xr-x\". The recommended permission is \"-rwx------\" to prevent possible unprivileged access to the data"}
Dec 10 16:01:22 master etcd[18172]: {"level":"info","ts":"2024-12-10T16:01:22.960848+0800","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/var/lib/etcd/member/snap/db","took":"1.016372ms"}
Dec 10 16:01:22 master etcd[18172]: {"level":"info","ts":"2024-12-10T16:01:22.961965+0800","caller":"embed/etcd.go:375","msg":"closing etcd server","name":"etcd-master","data-dir":"/var/lib/etcd","advertise-peer-urls":["https://172.16.21.35:2380"],"advertise-client-urls":["https://172.16.21.35:2379"]}
Dec 10 16:01:22 master etcd[18172]: {"level":"info","ts":"2024-12-10T16:01:22.962028+0800","caller":"embed/etcd.go:377","msg":"closed etcd server","name":"etcd-master","data-dir":"/var/lib/etcd","advertise-peer-urls":["https://172.16.21.35:2380"],"advertise-client-urls":["https://172.16.21.35:2379"]}
Dec 10 16:01:22 master etcd[18172]: {"level":"fatal","ts":"2024-12-10T16:01:22.962054+0800","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"cannot fetch cluster info from peer urls: could not retrieve cluster information from the given URLs","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\tgo.etcd.io/etcd/server/v3/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\tgo.etcd.io/etcd/server/v3/etcdmain/main.go:40\nmain.main\n\tgo.etcd.io/etcd/server/v3/main.go:31\nruntime.main\n\truntime/proc.go:267"}
Dec 10 16:01:22 master systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE
-- Subject: Unit process exited

日志显示警告信息，提示 /var/lib/etcd 的目录权限不符合推荐标准。推荐的权限为 700 (drwx------)，而当前目录权限为 755 (drwxr-xr-x)，可能会导致数据被未授权用户访问。

解决方法
修改目录权限 按照推荐权限修改 /var/lib/etcd 目录：


chmod 700 /var/lib/etcd
检查目录所有权 确保该目录的所有者和组为 etcd 用户：


chown -R etcd:etcd /var/lib/etcd
重新启动服务 修改完成后，重新启动 etcd 服务：


systemctl start etcd

修改成单节点模式

ETCD_INITIAL_CLUSTER_STATE=new


删除旧数据
rm -rf /var/lib/etcd/*

启动 sudo systemctl start etcd


启动成功

接着执行启动

./kk delete cluster -f config-sample.yaml
./kk create cluster -f config-sample.yaml -a kubesphere.tar.gz --with-local-storage

报证书问题

 [WARNING ImagePull]: failed to pull image dockerhub.kubekey.local/kubesphereio/coredns:1.9.3: output: E1210 16:48:21.678224   36794 remote_image.go:180] "PullImage from image service failed" err="rpc error: code = Unknown desc = failed to pull and unpack image \"dockerhub.kubekey.local/kubesphereio/coredns:1.9.3\": failed to resolve reference \"dockerhub.kubekey.local/kubesphereio/coredns:1.9.3\": failed to do request: Head \"https://dockerhub.kubekey.local/v2/kubesphereio/coredns/manifests/1.9.3\": tls: failed to verify certificate: x509: certificate signed by unknown authority" image="dockerhub.kubekey.local/kubesphereio/coredns:1.9.3"
time="2024-12-10T16:48:21+08:00" level=fatal msg="pulling image: failed to pull and unpack image \"dockerhub.kubekey.local/kubesphereio/coredns:1.9.3\": failed to resolve reference \"dockerhub.kubekey.local/kubesphereio/coredns:1.9.3\": failed to do request: Head \"https://dockerhub.kubekey.local/v2/kubesphereio/coredns/manifests/1.9.3\": tls: failed to verify certificate: x509: certificate signed by unknown authority"
, error: exit status 1

这是一个告警可以忽略不计

sudo mkdir -p /etc/docker/certs.d/dockerhub.kubekey.local
发现此目录 /etc/docker/certs.d/dockerhub.kubekey.local 已经有 harbor的证书

报错

error: Pipeline[CreateClusterPipeline] execute failed: Module[KubernetesStatusModule] exec failed: 
failed: [master] [GetClusterStatus] exec failed after 3 retries: get kubernetes cluster info failed: Failed to exec command: sudo -E /bin/bash -c "/usr/local/bin/kubectl --no-headers=true get nodes -o custom-columns=:metadata.name,:status.nodeInfo.kubeletVersion,:status.addresses" 
E1211 10:19:34.703393   25975 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E1211 10:19:34.704539   25975 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E1211 10:19:34.705431   25975 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E1211 10:19:34.707890   25975 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E1211 10:19:34.708539   25975 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused

切换到root账号

主要问题,使用 journalctl -xeu kubelet 查看错误日志

Dec 11 10:26:58 master kubelet[48502]: E1211 10:26:58.289043   48502 remote_runtime.go:176] "RunPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = failed to get sandbox image \"registry.k8s.io/pause:3.8\": failed to pull image \"registry.k8s.io/pause:3.8\": failed to pull and unpack image \"registry.k8s.io/pause:3.8\": failed to resolve reference \"registry.k8s.io/pause:3.8\": failed to do request: Head \"https://registry.k8s.io/v2/pause/manifests/3.8\": dial tcp 34.96.108.209:443: i/o timeout"
Dec 11 10:26:58 master kubelet[48502]: E1211 10:26:58.289224   48502 kuberuntime_sandbox.go:72] "Failed to create sandbox for pod" err="rpc error: code = DeadlineExceeded desc = failed to get sandbox image \"registry.k8s.io/pause:3.8\": failed to pull image \"registry.k8s.io/pause:3.8\": failed to pull and unpack image \"registry.k8s.io/pause:3.8\": failed to resolve reference \"registry.k8s.io/pause:3.8\": failed to do request: Head \"https://registry.k8s.io/v2/pause/manifests/3.8\": dial tcp 34.96.108.209:443: i/o timeout" pod="kube-system/kube-scheduler-master"
Dec 11 10:26:58 master kubelet[48502]: E1211 10:26:58.289289   48502 kuberuntime_manager.go:782] "CreatePodSandbox for pod failed" err="rpc error: code = DeadlineExceeded desc = failed to get sandbox image \"registry.k8s.io/pause:3.8\": failed to pull image \"registry.k8s.io/pause:3.8\": failed to pull and unpack image \"registry.k8s.io/pause:3.8\": failed to resolve reference \"registry.k8s.io/pause:3.8\": failed to do request: Head \"https://registry.k8s.io/v2/pause/manifests/3.8\": dial tcp 34.96.108.209:443: i/o timeout" pod="kube-system/kube-scheduler-master"
Dec 11 10:26:58 master kubelet[48502]: E1211 10:26:58.289468   48502 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"kube-scheduler-master_kube-system(4ca7fb2db07d0f724baa8308d590dcb6)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"kube-scheduler-master_kube-system(4ca7fb2db07d0f724baa8308d590dcb6)\\\": rpc error: code = DeadlineExceeded desc = failed to get sandbox image \\\"registry.k8s.io/pause:3.8\\\": failed to pull image \\\"registry.k8s.io/pause:3.8\\\": failed to pull and unpack image \\\"registry.k8s.io/pause:3.8\\\": failed to resolve reference \\\"registry.k8s.io/pause:3.8\\\": failed to do request: Head \\\"https://registry.k8s.io/v2/pause/manifests/3.8\\\": dial tcp 34.96.108.209:443: i/o timeout\"" pod="kube-system/kube-scheduler-master" podUID=4ca7fb2db07d0f724baa8308d590dcb6
Dec 11 10:26:58 master kubelet[48502]: E1211 10:26:58.289619   48502 remote_runtime.go:176] "RunPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = failed to get sandbox image \"registry.k8s.io/pause:3.8\": failed to pull image \"registry.k8s.io/pause:3.8\": failed to pull and unpack image \"registry.k8s.io/pause:3.8\": failed to resolve reference \"registry.k8s.io/pause:3.8\": failed to do request: Head \"https://registry.k8s.io/v2/pause/manifests/3.8\": dial tcp 34.96.108.209:443: i/o timeout"
Dec 11 10:26:58 master kubelet[48502]: E1211 10:26:58.289697   48502 kuberuntime_sandbox.go:72] "Failed to create sandbox for pod" err="rpc error: code = DeadlineExceeded desc = failed to get sandbox image \"registry.k8s.io/pause:3.8\": failed to pull image \"registry.k8s.io/pause:3.8\": failed to pull and unpack image \"registry.k8s.io/pause:3.8\": failed to resolve reference \"registry.k8s.io/pause:3.8\": failed to do request: Head \"https://registry.k8s.io/v2/pause/manifests/3.8\": dial tcp 34.96.108.209:443: i/o timeout" pod="kube-system/kube-controller-manager-master"
Dec 11 10:26:58 master kubelet[48502]: E1211 10:26:58.289740   48502 kuberuntime_manager.go:782] "CreatePodSandbox for pod failed" err="rpc error: code = DeadlineExceeded desc = failed to get sandbox image \"registry.k8s.io/pause:3.8\": failed to pull image \"registry.k8s.io/pause:3.8\": failed to pull and unpack image \"registry.k8s.io/pause:3.8\": failed to resolve reference \"registry.k8s.io/pause:3.8\": failed to do request: Head \"https://registry.k8s.io/v2/pause/manifests/3.8\": dial tcp 34.96.108.209:443: i/o timeout" pod="kube-system/kube-controller-manager-master"
Dec 11 10:26:58 master kubelet[48502]: E1211 10:26:58.289833   48502 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"kube-controller-manager-master_kube-system(f4e475d9dffaba24cb459a418e20d79b)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"kube-controller-manager-master_kube-system(f4e475d9dffaba24cb459a418e20d79b)\\\": rpc error: code = DeadlineExceeded desc = failed to get sandbox image \\\"registry.k8s.io/pause:3.8\\\": failed to pull image \\\"registry.k8s.io/pause:3.8\\\": failed to pull and unpack image \\\"registry.k8s.io/pause:3.8\\\": failed to resolve reference \\\"registry.k8s.io/pause:3.8\\\": failed to do request: Head \\\"https://registry.k8s.io/v2/pause/manifests/3.8\\\": dial tcp 34.96.108.209:443: i/o timeout\"" pod="kube-system/kube-controller-manager-master" podUID=f4e475d9dffaba24cb459a418e20d79b

这个错误主要原因是拉不到镜像 registry.k8s.io/pause:3.8 , 可见离线部署文档并不完善

手工下载 docker pull kubesphere/pause:3.8

然后打tag

docker tag kubesphere/pause:3.8 registry.k8s.io/pause:3.8

执行安装

./kk create cluster -f config-sample.yaml -a kubesphere.tar.gz --with-local-storage

仍然报错

get kubernetes cluster info failed: Failed to exec command: sudo -E /bin/bash -c "/usr/local/bin/kubectl --no-headers=true get nodes -o custom-columns=:metadata.name,:status.nodeInfo.kubeletVersion,:status.addresses" 
E1211 11:10:34.727469   27176 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E1211 11:10:34.728832   27176 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E1211 11:10:34.729168   27176 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E1211 11:10:34.730709   27176 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E1211 11:10:34.731018   27176 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
The connection to the server localhost:8080 was refused - did you specify the right host or port?: Process exited with status 1
11:10:34 CST failed: [master]

查看kubelet 日志 sudo journalctl -xeu kubelet

Dec 11 11:14:45 master kubelet[48502]: E1211 11:14:45.286340   48502 remote_runtime.go:176] "RunPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = failed to get sandbox image \"registry.k8s.io/pause:3.8\": failed to pull image \"registry.k8s.io/pause:3.8\": failed to pull and unpack image \"registry.k8s.io/pause:3.8\": failed to resolve reference \"registry.k8s.io/pause:3.8\": failed to do request: Head \"https://registry.k8s.io/v2/pause/manifests/3.8\": dial tcp 34.96.108.209:443: i/o timeout"
Dec 11 11:14:45 master kubelet[48502]: E1211 11:14:45.286463   48502 kuberuntime_sandbox.go:72] "Failed to create sandbox for pod" err="rpc error: code = DeadlineExceeded desc = failed to get sandbox image \"registry.k8s.io/pause:3.8\": failed to pull image \"registry.k8s.io/pause:3.8\": failed to pull and unpack image \"registry.k8s.io/pause:3.8\": failed to resolve reference \"registry.k8s.io/pause:3.8\": failed to do request: Head \"https://registry.k8s.io/v2/pause/manifests/3.8\": dial tcp 34.96.108.209:443: i/o timeout" pod="kube-system/kube-scheduler-master"
Dec 11 11:14:45 master kubelet[48502]: E1211 11:14:45.286493   48502 kuberuntime_manager.go:782] "CreatePodSandbox for pod failed" err="rpc error: code = DeadlineExceeded desc = failed to get sandbox image \"registry.k8s.io/pause:3.8\": failed to pull image \"registry.k8s.io/pause:3.8\": failed to pull and unpack image \"registry.k8s.io/pause:3.8\": failed to resolve reference \"registry.k8s.io/pause:3.8\": failed to do request: Head \"https://registry.k8s.io/v2/pause/manifests/3.8\": dial tcp 34.96.108.209:443: i/o timeout" pod="kube-system/kube-scheduler-master"
Dec 11 11:14:45 master kubelet[48502]: E1211 11:14:45.286613   48502 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"kube-scheduler-master_kube-system(4ca7fb2db07d0f724baa8308d590dcb6)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"kube-scheduler-master_kube-system(4ca7fb2db07d0f724baa8308d590dcb6)\\\": rpc error: code = DeadlineExceeded desc = failed to get sandbox image \\\"registry.k8s.io/pause:3.8\\\": failed to pull image \\\"registry.k8s.io/pause:3.8\\\": failed to pull and unpack image \\\"registry.k8s.io/pause:3.8\\\": failed to resolve reference \\\"registry.k8s.io/pause:3.8\\\": failed to do request: Head \\\"https://registry.k8s.io/v2/pause/manifests/3.8\\\": dial tcp 34.96.108.209:443: i/o timeout\"" pod="kube-system/kube-scheduler-master" podUID=4ca7fb2db07d0f724baa8308d590dcb6
Dec 11 11:14:45 master kub

分析 kubelet状态发现是运行状态

systemctl status kubelet

kubeadm kubectl get pod --all-namespaces 报错

E1211 11:47:34.401922 27583 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused

重新启动, 检查控制台报错

The recommended value for “clusterDNS” in “KubeletConfiguration” is: [10.233.0.10]; the provided value is: [169.254.25.10]

解决办法

sudo vi /var/lib/kubelet/config.yaml 

修改如下内容

clusterDNS:
- 169.254.25.10

为

clusterDNS:
- 10.233.0.10

解决 "command failed" err="failed to validate kubelet flags: the container runtime endpoint address was not specified or empty, use --container-runtime-endpoint to set


sudo vi /etc/systemd/system/kubelet.service.d/10-kubeadm.conf

ExecStart=/usr/local/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS --container-runtime-endp
oint=unix:///run/containerd/containerd.sock

现在可以确认的是 k8s api server 没有启来, 查不到任何日志

检查 yaml

yamllint /etc/kubernetes/manifests/kube-apiserver.yaml

发现报错

通过vscode YAML插件格式化后, 估计还是解决不了问题

确实没有效果, 通过 journalctl -u kubelet -f 查看滚动日志发现错误日志

Dec 11 15:40:29 master kubelet[38099]: E1211 15:40:29.006151 38099 file.go:187] "Could not process manifest file" err="/etc/kubernetes/manifests/ystemctl status docker: couldn't parse as pod(yaml: control characters are not allowed), please check config file" path="/etc/kubernetes/manifests/ystemctl status docker"

ps aux | grep kubelet
root 38099 3.1 0.1 4755728 109976 ? Ssl 14:33 2:12 /usr/local/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=cgroupfs --container-runtime-endpoint=unix:///run/containerd/containerd.sock --pod-infra-container-image=dockerhub.kubekey.local/kubesphereio/pause:3.9 --node-ip=172.16.21.35 --hostname-override=master

发现一个错误的文件, 删除之, 仍然没啥用

滚动的错误日志 journalctl -u kubelet -f

Dec 11 15:48:26 master kubelet[38099]: E1211 15:48:26.638739   38099 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"kube-controller-manager-master_kube-system(f4e475d9dffaba24cb459a418e20d79b)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"kube-controller-manager-master_kube-system(f4e475d9dffaba24cb459a418e20d79b)\\\": rpc error: code = Unknown desc = failed to get sandbox image \\\"registry.k8s.io/pause:3.8\\\": failed to pull image \\\"registry.k8s.io/pause:3.8\\\": failed to pull and unpack image \\\"registry.k8s.io/pause:3.8\\\": failed to resolve reference \\\"registry.k8s.io/pause:3.8\\\": failed to do request: Head \\\"https://us-west2-docker.pkg.dev/v2/k8s-artifacts-prod/images/pause/manifests/3.8\\\": dial tcp 74.125.199.82:443: i/o timeout\"" pod="kube-system/kube-controller-manager-master" podUID=f4e475d9dffaba24cb459a418e20d79b
Dec 11 15:48:29 master kubelet[38099]: I1211 15:48:29.073478   38099 status_manager.go:698] "Failed to get status for pod" podUID=9eb830c8cce30bfcab1dc46488c4c23e pod="kube-system/kube-apiserver-master" err="Get \"https://lb.kubesphere.local:6443/api/v1/namespaces/kube-system/pods/kube-apiserver-master\": dial tcp 172.16.21.35:6443: connect: connection refused"
Dec 11 15:48:29 master kubelet[38099]: E1211 15:48:29.640909   38099 eviction_manager.go:261] "Eviction manager: failed to get summary stats" err="failed to get node info: node \"master\" not found"
Dec 11 15:48:30 master kubelet[38099]: E1211 15:48:30.173752   38099 controller.go:146] failed to ensure lease exists, will retry in 7s, error: Get "https://lb.kubesphere.local:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/master?timeout=10s": dial tcp 172.16.21.35:6443: connect: connection refused
Dec 11 15:48:32 master kubelet[38099]: I1211 15:48:32.253793   38099 kubelet_node_status.go:70] "Attempting to register node" node="master"
Dec 11 15:48:32 master kubelet[38099]: E1211 15:48:32.255001   38099 kubelet_node_status.go:92] "Unable to register node with API server" err="Post \"https://lb.kubesphere.local:6443/api/v1/nodes\": dial tcp 172.16.21.35:6443: connect: connection refused" node="master"
Dec 11 15:48:33 master kubelet[38099]: E1211 15:48:33.140053   38099 event.go:276] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"kube-controller-manager-master.18100bd8e22b2b4b", GenerateName:"", Namespace:"kube-system", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"kube-controller-manager-master", UID:"f4e475d9dffaba24cb459a418e20d79b", APIVersion:"v1", ResourceVersion:"", FieldPath:""}, Reason:"FailedCreatePodSandBox", Message:"Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = failed to get sandbox image \"registry.k8s.io/pause:3.8\": failed to pull image \"registry.k8s.io/pause:3.8\": failed to pull and unpack image \"registry.k8s.io/pause:3.8\": failed to resolve reference \"registry.k8s.io/pause:3.8\": failed to do request: Head \"https://us-west2-docker.pkg.dev/v2/k8s-artifacts-prod/images/pause/manifests/3.8\": dial tcp 74.125.195.82:443: i/o timeout", Source:v1.EventSource{Component:"kubelet", Host:"master"}, FirstTimestamp:time.Date(2024, time.December, 11, 14, 34, 42, 672962379, time.Local), LastTimestamp:time.Date(2024, time.December, 11, 14, 34, 42, 672962379, time.Local), Count:1, Type:"Warning", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"kubelet", ReportingInstance:"master"}': 'Post "https://lb.kubesphere.local:6443/api/v1/namespaces/kube-system/events": dial tcp 172.16.21.35:6443: connect: connection refused'(may retry after sleeping)
Dec 11 15:48:37 master kubelet[38099]: E1211 15:48:37.175829   38099 controller.go:146] failed to ensure lease exists, will retry in 7s, error: Get "https://lb.kubesphere.local:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/master?timeout=10s": dial tcp 172.16.21.35:6443: connect: connection refused
Dec 11 15:48:39 master kubelet[38099]: I1211 15:48:39.073764   38099 status_manager.go:698] "Failed to get status for pod" podUID=9eb830c8cce30bfcab1dc46488c4c23e pod="kube-system/kube-apiserver-master" err="Get \"https://lb.kubesphere.local:6443/api/v1/namespaces/kube-system/pods/kube-apiserver-master\": dial tcp 172.16.21.35:6443: connect: connection refused"
Dec 11 15:48:39 master kubelet[38099]: I1211 15:48:39.258619   38099 kubelet_node_status.go:70] "Attempting to register node" node="master"
Dec 11 15:48:39 master kubelet[38099]: E1211 15:48:39.259595   38099 kubelet_node_status.go:92] "Unable to register node with API server" err="Post \"https://lb.kubesphere.local:6443/api/v1/nodes\": dial tcp 172.16.21.35:6443: connect: connection refused" node="master"
Dec 11 15:48:39 master kubelet[38099]: E1211 15:48:39.641239   38099 eviction_manager.go:261] "Eviction manager: failed to get summary stats" err="failed to get node info: node \"master\" not found"
Dec 11 15:48:40 master kubelet[38099]: E1211 15:48:40.183260   38099 certificate_manager.go:471] kubernetes.io/kube-apiserver-client-kubelet: Failed while requesting a signed certificate from the control plane: cannot create certificate signing request: Post "https://lb.kubesphere.local:6443/apis/certificates.k8s.io/v1/certificatesigningrequests": dial tcp 172.16.21.35:6443: connect: connection refused
Dec 11 15:48:41 master kubelet[38099]: W1211 15:48:41.010588   38099 reflector.go:424] vendor/k8s.io/client-go/informers/factory.go:150: failed to list *v1.Node: Get "https://lb.kubesphere.local:6443/api/v1/nodes?fieldSelector=metadata.name%3Dmaster&limit=500&resourceVersion=0": dial tcp 172.16.21.35:6443: connect: connection refused
Dec 11 15:48:41 master kubelet[38099]: E1211 15:48:41.010680   38099 reflector.go:140] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://lb.kubesphere.local:6443/api/v1/nodes?fieldSelector=metadata.name%3Dmaster&limit=500&resourceVersion=0": dial tcp 172.16.21.35:6443: connect: connection refused
Dec 11 15:48:42 master kubelet[38099]: E1211 15:48:42.430816   38099 file.go:108] "Unable to process watch event" err="the pod with key kube-system/kube-apiserver-master doesn't exist in cache"
Dec 11 15:48:43 master kubelet[38099]: E1211 15:48:43.141377   38099 event.go:276] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"kube-controller-manager-master.18100bd8e22b2b4b", GenerateName:"", Namespace:"kube-system", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"kube-controller-manager-master", UID:"f4e475d9dffaba24cb459a418e20d79b", APIVersion:"v1", ResourceVersion:"", FieldPath:""}, Reason:"FailedCreatePodSandBox", Message:"Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = failed to get sandbox image \"registry.k8s.io/pause:3.8\": failed to pull image \"registry.k8s.io/pause:3.8\": failed to pull and unpack image \"registry.k8s.io/pause:3.8\": failed to resolve reference \"registry.k8s.io/pause:3.8\": failed to do request: Head \"https://us-west2-docker.pkg.dev/v2/k8s-artifacts-prod/images/pause/manifests/3.8\": dial tcp 74.125.195.82:443: i/o timeout", Source:v1.EventSource{Component:"kubelet", Host:"master"}, FirstTimestamp:time.Date(2024, time.December, 11, 14, 34, 42, 672962379, time.Local), LastTimestamp:time.Date(2024, time.December, 11, 14, 34, 42, 672962379, time.Local), Count:1, Type:"Warning", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"kubelet", ReportingInstance:"master"}': 'Post "https://lb.kubesphere.local:6443/api/v1/namespaces/kube-system/events": dial tcp 172.16.21.35:6443: connect: connection refused'(may retry after sleeping)
Dec 11 15:48:44 master kubelet[38099]: E1211 15:48:44.178978   38099 controller.go:146] failed to ensure lease exists, will retry in 7s, error: Get "https://lb.kubesphere.local:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/master?timeout=10s": dial tcp 172.16.21.35:6443: connect: connection refused
Dec 11 15:48:46 master kubelet[38099]: I1211 15:48:46.263268   38099 kubelet_node_status.go:70] "Attempting to register node" node="master"
Dec 11 15:48:46 master kubelet[38099]: E1211 15:48:46.264191   38099 kubelet_node_status.go:92] "Unable to register node with API server" err="Post \"https://lb.kubesphere.local:6443/api/v1/nodes\": dial tcp 172.16.21.35:6443: connect: connection refused" node="master"
Dec 11 15:48:46 master kubelet[38099]: W1211 15:48:46.972132   38099 reflector.go:424] vendor/k8s.io/client-go/informers/factory.go:150: failed to list *v1.CSIDriver: Get "https://lb.kubesphere.local:6443/apis/storage.k8s.io/v1/csidrivers?limit=500&resourceVersion=0": dial tcp 172.16.21.35:6443: connect: connection refused
Dec 11 15:48:46 master kubelet[38099]: E1211 15:48:46.972238   38099 reflector.go:140] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.CSIDriver: failed to list *v1.CSIDriver: Get "https://lb.kubesphere.local:6443/apis/storage.k8s.io/v1/csidrivers?limit=500&resourceVersion=0": dial tcp 172.16.21.35:6443: connect: connection refused
Dec 11 15:48:49 master kubelet[38099]: I1211 15:48:49.007214   38099 topology_manager.go:210] "Topology Admit Handler" podUID=9eb830c8cce30bfcab1dc46488c4c23e podNamespace="kube-system" podName="kube-apiserver-master"
Dec 11 15:48:49 master kubelet[38099]: E1211 15:48:49.642259   38099 eviction_manager.go:261] "Eviction manager: failed to get summary stats" err="failed to get node info: node \"master\" not found"
Dec 11 15:48:51 master kubelet[38099]: E1211 15:48:51.180776   38099 controller.go:146] failed to ensure lease exists, will retry in 7s, error: Get "https://lb.kubesphere.local:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/master?timeout=10s": dial tcp 172.16.21.35:6443: connect: connection refused

先解决拉取镜像 pause 失败的问题

containerd config default > /etc/containerd/config.toml


sudo vi /etc/containerd/config.toml

[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
  [plugins."io.containerd.grpc.v1.cri".registry.mirrors."registry.k8s.io"]
    endpoint = ["https://xxxxxxxxx.mirror.swr.myhuaweicloud.com"]
  [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
    endpoint = ["https://xxxxxxxxx.mirror.swr.myhuaweicloud.com"]

这里的 https://registry.aliyuncs.com  替换成自己的 参考教程 https://support.huaweicloud.com/usermanual-swr/swr_01_0045.html


systemctl restart containerd


没啥效果

将 Docker 中的镜像导入到 containerd

docker save registry.k8s.io/pause:3.8 -o pause.tar 保存

ctr -n k8s.io images import pause.tar 导入

执行安装

./kk create cluster -f config-sample.yaml -a kubesphere.tar.gz --with-local-storage

发现错误

Dec 11 16:48:36 master kubelet[43735]: E1211 16:48:36.464582   43735 remote_image.go:171] "PullImage from image service failed" err="rpc error: code = Unknown desc = failed to pull and unpack image \"dockerhub.kubekey.local/kubesphereio/kube-scheduler:v1.26.12\": failed to resolve reference \"dockerhub.kubekey.local/kubesphereio/kube-scheduler:v1.26.12\": failed to do request: Head \"https://dockerhub.kubekey.local/v2/kubesphereio/kube-scheduler/manifests/v1.26.12\": tls: failed to verify certificate: x509: certificate signed by unknown authority" image="dockerhub.kubekey.local/kubesphereio/kube-scheduler:v1.26.12"

Dec 11 16:48:36 master kubelet[43735]: E1211 16:48:36.464615   43735 kuberuntime_image.go:53] "Failed to pull image" err="rpc error: code = Unknown desc = failed to pull and unpack image \"dockerhub.kubekey.local/kubesphereio/kube-scheduler:v1.26.12\": failed to resolve reference \"dockerhub.kubekey.local/kubesphereio/kube-scheduler:v1.26.12\": failed to do request: Head \"https://dockerhub.kubekey.local/v2/kubesphereio/kube-scheduler/manifests/v1.26.12\": tls: failed to verify certificate: x509: certificate signed by unknown authority" image="dockerhub.kubekey.local/kubesphereio/kube-scheduler:v1.26.12"


通过手工的方式 pull

docker pull dockerhub.kubekey.local/kubesphereio/kube-apiserver:v1.26.12
docker pull dockerhub.kubekey.local/kubesphereio/kube-scheduler:v1.26.12


然后报错变成

Dec 11 16:55:45 master kubelet[43735]: E1211 16:55:45.453268   43735 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-apiserver\" with ImagePullBackOff: \"Back-off pulling image \\\"dockerhub.kubekey.local/kubesphereio/kube-apiserver:v1.26.12\\\"\"" pod="kube-system/kube-apiserver-master" podUID=9eb830c8cce30bfcab1dc46488c4c23e


这个报错是因为我把harbor装在了别的机器的原因, 容器里网络不通导致的?
如果处理?


可以尝试使用宿主机网络启动 Pod：

yaml
复制代码
spec:
  hostNetwork: true

上述仍然不行, 准备使用在线方案安装试试

export KKZONE=cn
./kk create cluster --with-kubernetes v1.22.12 --with-kubesphere v3.4.1


果然比离线安装好多了, 至少 docKer镜像都创建了.


Console: http://172.16.21.35:30880
Account: admin
Password: P@88w0rd

总结: 离线安装太不靠谱了, 问题很多, 很难解决, 学习成本太高