离线安装 KubeSphere实操记录 安装失败, 各种问题

本实操记录, 根据官网提供的教程, 结合实际的服务器环境进行记录

离线安装 4.1教程网址: 离线安装 KubeSphere

参考如下示例准备至少三台主机。

主机 IP主机名称角色

172.16.20.21

node1

联网主机,用于制作离线包

172.16.21.30

master

离线环境的主节点

172.16.20.20

node2

离线环境的镜像仓库节点(若已有镜像仓库,可忽略)

此服务器为我使用nexus部署的docker镜像仓库

master 和 node2 上需安装 socat 和 conntrack。

我的服务器使用的是centos7.9, 与 Rocky Linux release 8.10 (Green Obsidian), 所以换成yum进行安装

yum install socat conntrack -y

获取版本信息及镜像列表

  1. 访问 KubeSphere Images

  2. 选择需要部署的扩展组件。这里我全选

  3. 填入邮箱地址。

  4. 点击获取镜像列表

  5. 查看填写的邮箱,获取 KubeSphere 最新的版本信息以及镜像列表文件。

获取后, 进入邮箱检查邮件, 得到附件

文件名描述

kubesphere-images.txt

包含 KubeSphere 及扩展组件涉及的所有镜像,以及在华为云的镜像地址,可根据该文件中的列表将镜像同步至离线仓库中。

kk-manifest.yaml

包含 KubeSphere 及扩展组件涉及的所有镜像,可使用 kk 快速构建离线包。

kk-manifest-mirror.yaml

包含华为云镜像仓库中 KubeSphere 及扩展组件涉及的所有镜像。访问 DockerHub 受限时可使用该 manifest 文件构建离线包。

构建离线安装包

登录可访问互联网的节点 node1,参照以下步骤构建 KubeSphere 离线安装包。

1. 安装 KubeKey

执行以下命令安装⼯具 KubeKey。

下载完成后当前目录下将生成 KubeKey 二进制文件 kk

ssh 172.16.20.21

事先创建专用目录 mkdir /home/kubesphere

进入目录 cd /home/kubesphere

执行如下命令

curl -sSL https://get-kk.kubesphere.io | sh -

经过多次尝试, 终于下载kk成功

2. 创建 manifest 文件

下面这个描述, 会让人看不明白什么是是"只需要使用kk打包KubeSphere镜像至离线环境", 什么叫 使用 "kk部署Kubernetes以及镜像仓库" , 让人读不懂, 实际上我的目标就是离线部署, 那么先不用管, 按步骤来, 后面再总结

执行命令

export KKZONE=cn


# 如需使用 kk 离线部署镜像仓库,添加 --with-registry 打包镜像仓库的安装文件
./kk create manifest --with-kubernetes v1.26.12 --with-registry

# 这里面我希望使用我自己的nexus部署的docker镜像, 所以不加 --with-registry

实际执行命令

./kk create manifest --with-kubernetes v1.26.12

该命令将创建一个 manifest-sample.yaml 文件。

3. 编辑 manifest 文件

若需要使用 kk 部署 Kubernetes 以及镜像仓库,将从邮件获取到的 KubeSphere 镜像列表添加到新创建的 manifest 文件中即可。

上面这句话让人很费解, 什么叫"若需要使用 kk 部署 Kubernetes 以及镜像仓库"? 那肯定需要呀

镜像仓库如果使用本地, 没有单独交待如何处理, 按教程接着往下走

  • 打开 manifest 文件。
vi manifest-sample.yaml
  • 复制 kk-manifest.yaml 或 kk-manifest-mirror.yaml(若访问 DockerHub 受限) 中的镜像列表,添加到新创建的 manifest-sample.yaml 文件中。

那么这里面的意思是根据你的网络来选择复制哪个文件, 我在国内肯定选择 kk-manifest-mirror.yaml 中的文件进行复制, 将 kk-manifest-mirror.yaml中的内容复制到 manifest-sample.yaml , 注意上面的红色加粗字, 仅复制镜像列表, 其它内容不要复制, 贴到 manifest-sample.yaml 镜像列表的后面 spec.images 下方, 为啥确定是复制kk-manifest-mirror.yaml文件中的内容, 从名称上看, 一个是原始的, 一个是镜像, 而且 kk-manifest-mirror.yaml中的网址是如下的企业镜像网址 swr.cn-southwest-2.myhuaweicloud.com

修改后的内容参见绑定资源: manifest-sample.yaml

4. 构建离线包

执行以下命令构建包含 ks-core 及各扩展组件镜像的离线安装包。

./kk artifact export -m manifest-sample.yaml -o kubesphere.tar.gz

这里推荐使用后台运行 因为需要时间比较长, 容易断线

增加变量: export KKZONE=cn 

nohup ./kk artifact export -m manifest-sample.yaml -o kubesphere.tar.gz > k.log 2>&1 &

另外出错的时候, 主要是下载github报超时, 手工执行一下, 多操作几次就好了

执行成功后,将显示如下信息

images/index.json
images/oci-layout
kube/v1.26.12/amd64/kubeadm
kube/v1.26.12/amd64/kubectl
kube/v1.26.12/amd64/kubelet
runc/v1.1.12/amd64/runc.amd64
09:55:41 CST success: [LocalHost]
09:55:41 CST [ChownOutputModule] Chown output file
09:55:41 CST success: [LocalHost]
09:55:41 CST [ChownWorkerModule] Chown ./kubekey dir
09:55:41 CST success: [LocalHost]
09:55:41 CST Pipeline[ArtifactExportPipeline] execute successfully

5. 下载 KubeSphere Core Helm Chart

  1. 安装 helm。

    curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

    安装成功提示 

  2. 下载 KubeSphere Core Helm Chart。

    VERSION=1.1.3     # Chart 版本
    helm fetch https://charts.kubesphere.io/main/ks-core-${VERSION}.tgz
    
    
    当前目录会下载文件:  ks-core-1.1.3.tgz 

    此处为示例版本,请访问 https://get-images.kubesphere.io 或 KubeSphere GitHub 仓库查看最新 chart 版本。

当前是2024年12月5日,看到的最新版本是 1.1.3

离线部署

1. 准备工作

将联网主机 node1 上的三个文件同步至离线环境的 master 节点。

  • kk

  • kubesphere.tar.gz

  • ks-core-1.1.3.tgz

这里我将这3个文件复制到 172.16.21.30

scp kk hadoop@172.16.21.30:/work1/kubesphere

scp kubesphere.tar.gz hadoop@172.16.21.30:/work1/kubesphere

scp ks-core-1.1.3.tgz hadoop@172.16.21.30:/work1/kubesphere

2. 创建配置文件

  1. 创建离线集群配置文件。

    在离线的主节点, hadoop@172.16.21.30:/work1/kubesphere
    执行如下命令
    
    ./kk create config --with-kubernetes v1.26.12

  2. 修改配置文件。

    vi config-sample.yaml
    说明
    • 按照离线环境的实际配置修改节点信息。

    • 指定 registry 仓库的部署节点,用于 KubeKey 部署自建 Harbor 仓库。

    • registry 里可以指定 type 类型为 harbor,否则默认安装 docker registry。

    • 对于 Kubernetes v1.24+,建议将 containerManager 设置为 containerd

修改后的内容如下


apiVersion: kubekey.kubesphere.io/v1alpha2
kind: Cluster
metadata:
  name: sample
spec:
  hosts:
  - {name: master, address: 172.16.21.35, internalAddress: 172.16.21.35, user: docker, password: "*******"}
  roleGroups:
    etcd:
    - master
    control-plane: 
    - master
    worker:
    - master
    # 这里需要安装 harbor
    registry:
    - master
  controlPlaneEndpoint:
    ## Internal loadbalancer for apiservers 
    # internalLoadbalancer: haproxy

    domain: lb.kubesphere.local
    address: ""
    port: 6443
  kubernetes:
    version: v1.26.12
    clusterName: cluster.local
    autoRenewCerts: true
    containerManager: containerd
  etcd:
    type: kubekey
  network:
    plugin: calico
    kubePodsCIDR: 10.233.64.0/18
    kubeServiceCIDR: 10.233.0.0/18
    ## multus support. https://github.com/k8snetworkplumbingwg/multus-cni
    multusCNI:
      enabled: false
  registry:
    # 需使用 kk 部署 harbor, 将该参数设置为 harbor
    type: harbor
    auths:
      "dockerhub.kubekey.local":
        # 部署 harbor 时需指定 harbor 帐号密码
        username: admin
        password: Harbor12345
        skipTLSVerify: true
    # 设置集群部署时使用的私有仓库地址。
    privateRegistry: "dockerhub.kubekey.local"
    # 构建离线包时 Kubernetes 镜像使用的是阿里云仓库镜像,需配置该参数。
    namespaceOverride: "kubesphereio"
    registryMirrors: []
    insecureRegistries: []
  addons: []


3. 创建镜像仓库

执行以下命令创建镜像仓库。

cd /work1/kubesphere
./kk init registry -f config-sample.yaml -a kubesphere.tar.gz


报错, 由于我是离线状态, 肯定下载不了呀


downloading amd64 harbor v2.10.1  ...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (6) Could not resolve host: github.com
14:27:10 CST [WARN] Having a problem with accessing https://storage.googleapis.com? You can try again after setting environment 'export KKZONE=cn'
14:27:10 CST message: [LocalHost]
Failed to download harbor binary: curl -L -o /work1/kubesphere/kubekey/registry/harbor/v2.10.1/amd64/harbor-offline-installer-v2.10.1.tgz https://github.com/goharbor/harbor/releases/download/v2.10.1/harbor-offline-installer-v2.10.1.tgz error: exit status 6 
14:27:10 CST failed: [LocalHost]
error: Pipeline[InitRegistryPipeline] execute failed: Module[RegistryPackageModule] exec failed: 
failed: [LocalHost] [DownloadRegistryPackage] exec failed after 1 retries: Failed to download harbor binary: curl -L -o /work1/kubesphere/kubekey/registry/harbor/v2.10.1/amd64/harbor-offline-installer-v2.10.1.tgz https://github.com/goharbor/harbor/releases/download/v2.10.1/harbor-offline-installer-v2.10.1.tgz error: exit status 6 

于是我在可上网的机器下载, 然后拷贝过去

curl -L -o /home/kubesphere/kubekey/registry/harbor/v2.10.1/amd64/harbor-offline-installer-v2.10.1.tgz https://github.com/goharbor/harbor/releases/download/v2.10.1/harbor-offline-installer-v2.10.1.tgz

scp /home/kubesphere/kubekey/registry/harbor/v2.10.1/amd64/harbor-offline-installer-v2.10.1.tgz hadoop@172.16.21.30:/work1/kubesphere/kubekey/registry/harbor/v2.10.1/amd64/



Failed to download compose binary: curl -L -o /work1/kubesphere/kubekey/registry/compose/v2.26.1/amd64/docker-compose-linux-x86_64 https://github.com/docker/compose/releases/download/v2.26.1/docker-compose-linux-x86_64 error: exit status 6 
15:23:00 CST failed: [LocalHost]
error: Pipeline[InitRegistryPipeline] execute failed: Module[RegistryPackageModule] exec failed: 
failed: [LocalHost] [DownloadRegistryPackage] exec failed after 1 retries: Failed to download compose binary: curl -L -o /work1/kubesphere/kubekey/registry/compose/v2.26.1/amd64/docker-compose-linux-x86_64 https://github.com/docker/compose/releases/download/v2.26.1/docker-compose-linux-x86_64 error: exit status 6 




解决办法一样, 手工下载, 拷贝过去


报错
16:01:58 CST failed: [master]
error: Pipeline[CreateClusterPipeline] execute failed: Module[GreetingsModule] exec failed: 
failed: [master] execute task timeout, Timeout=30s


则在 - {name: master, address: 172.16.21.35, internalAddress: 172.16.21.35, user: docker, password: "KNAVqAmJUNCCsJ2U",timeout: 1200}

加上 timeout 仍然不行, 搜索到据说是ubuntu中文环境的问题

## vi /etc/default/locale 修改语言, 然后重启
LANG="en_US.UTF-8"
LANGUAGE="en_US:en"

换可上网的机器下载也很慢呀, 于是我科学上网专门下载安

  • config-sample.yaml 为离线集群的配置文件。

  • kubesphere.tar.gz 为包含 ks-core 及各扩展组件镜像的离线安装包。

如果显示如下信息,则表明镜像仓库创建成功。

Local image registry created successfully. Address: dockerhub.kubekey.local

15:59:09 CST success: [master]
15:59:09 CST [ChownWorkerModule] Chown ./kubekey dir
15:59:09 CST success: [LocalHost]
15:59:09 CST Pipeline[InitRegistryPipeline] execute successfully

这里面貌似是安装成功了, 但是当我访问的时候,确是访问不了呀.

不论是通过ip还是通过域名, 80端口都访问不了, 但是 通过 telnet 127.0.0.1 80 却是通的, 通过外网的ip就是不通, 可能被拦截了

4. 创建 harbor 项目(若镜像仓库为 Harbor)

说明

由于 Harbor 项目存在访问控制(RBAC)的限制,即只有指定角色的用户才能执行某些操作。如果您未创建项目,则镜像不能被推送到 Harbor。Harbor 中有两种类型的项目:

  • 公共项目(Public):任何用户都可以从这个项目中拉取镜像。

  • 私有项目(Private):只有作为项目成员的用户可以拉取镜像。

Harbor 管理员账号:admin,密码:Harbor12345

harbor 安装文件在 /opt/harbor 目录下,可在该目录下对 harbor 进行运维。

执行以下命令创建 harbor 项目。 这里面注意一个问题, https://dockerhub.kubekey.local 这个域名是自定义的, 安装 harbor时, 会写多条记录进 /etc/hosts, 如何样子

# kubekey hosts BEGIN
172.16.21.35  master.cluster.local master
172.16.21.35  dockerhub.kubekey.local
172.16.21.35  lb.kubesphere.local
# kubekey hosts END

 创建脚本配置文件。 

vi create_project_harbor.sh 写入如下内容, 然后执行, 写入的时候遇到各种问题, 不如多行贴到控制台执行

#!/usr/bin/env bash

# Copyright 2018 The KubeSphere Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

url="https://dockerhub.kubekey.local"  # 或修改为实际镜像仓库地址
user="admin"
passwd="Harbor12345"

harbor_projects=(
        ks
        kubesphere
        kubesphereio
        coredns
        calico
        flannel
        cilium
        hybridnetdev
        kubeovn
        openebs
        library
        plndr
        jenkins
        argoproj
        dexidp
        openpolicyagent
        curlimages
        grafana
        kubeedge
        nginxinc
        prom
        kiwigrid
        minio
        opensearchproject
        istio
        jaegertracing
        timberio
        prometheus-operator
        jimmidyson
        elastic
        thanosio
        brancz
        prometheus
)

for project in "${harbor_projects[@]}"; do
    echo "creating $project"
    curl -u "${user}:${passwd}" -X POST -H "Content-Type: application/json" "${url}/api/v2.0/projects" -d "{ \"project_name\": \"${project}\", \"public\": true}" -k  # 注意在 curl 命令末尾加上 -k
done

创建完成后 进入harbor页面可以看到项目

5. 安装 Kubernetes

执行以下命令创建 Kubernetes 集群:

./kk create cluster -f config-sample.yaml -a kubesphere.tar.gz --with-local-storage

报错了

etcd health check failed: Failed to exec command: sudo -E /bin/bash -c "export ETCDCTL_API=2;export ETCDCTL_CERT_FILE='/etc/ssl/etcd/ssl/admin-master.pem';export ETCDCTL_KEY_FILE='/etc/ssl/etcd/ssl/admin-master-key.pem';export ETCDCTL_CA_FILE='/etc/ssl/etcd/ssl/ca.pem';/usr/local/bin/etcdctl --endpoints=https://172.16.21.35:2379 cluster-health | grep -q 'cluster is healthy'" 
Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.16.21.35:2379: connect: connection refused

error #0: dial tcp 172.16.21.35:2379: connect: connection refused: Process exited with status 1
11:46:09 CST retry: [master]
11:46:14 CST message: [master]
etcd health check failed: Failed to exec command: sudo -E /bin/bash -c "export ETCDCTL_API=2;export ETCDCTL_CERT_FILE='/etc/ssl/etcd/ssl/admin-master.pem';export ETCDCTL_KEY_FILE='/etc/ssl/etcd/ssl/admin-master-key.pem';export ETCDCTL_CA_FILE='/etc/ssl/etcd/ssl/ca.pem';/usr/local/bin/etcdctl --endpoints=https://172.16.21.35:2379 cluster-health | grep -q 'cluster is healthy'" 
Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.16.21.35:2379: connect: connection refused

error #0: dial tcp 172.16.21.35:2379: connect: connection refused: Process exited with status 1
11:46:14 CST retry: [master]
11:46:19 CST message: [master]
etcd health check failed: Failed to exec command: sudo -E /bin/bash -c "export ETCDCTL_API=2;export ETCDCTL_CERT_FILE='/etc/ssl/etcd/ssl/admin-master.pem';export ETCDCTL_KEY_FILE='/etc/ssl/etcd/ssl/admin-master-key.pem';export ETCDCTL_CA_FILE='/etc/ssl/etcd/ssl/ca.pem';/usr/local/bin/etcdctl --endpoints=https://172.16.21.35:2379 cluster-health | grep -q 'cluster is healthy'" 
Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.16.21.35:2379: connect: connection refused

error #0: dial tcp 172.16.21.35:2379: connect: connection refused: Process exited with status 1
11:46:19 CST retry: [master]
11:46:25 CST message: [master]
etcd health check failed: Failed to exec command: sudo -E /bin/bash -c "export ETCDCTL_API=2;export ETCDCTL_CERT_FILE='/etc/ssl/etcd/ssl/admin-master.pem';export ETCDCTL_KEY_FILE='/etc/ssl/etcd/ssl/admin-master-key.pem';export ETCDCTL_CA_FILE='/etc/ssl/etcd/ssl/ca.pem';/usr/local/bin/etcdctl --endpoints=https://172.16.21.35:2379 cluster-health | grep -q 'cluster is healthy'" 
Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.16.21.35:2379: connect: connection refused

error #0: dial tcp 172.16.21.35:2379: connect: connection refused: Process exited with status 1
11:46:25 CST retry: [master]

按如下步骤开放端口也无济于事
删除重装 
./kk delete cluster -f config-sample.yaml
./kk create cluster -f config-sample.yaml -a kubesphere.tar.gz --with-local-storage


仍然报错, 发现是安装k8s时, 会把我原来的iptables的配置清空, 所以换一台机器单独安装harbor, 然后再按上述步骤安装k8s, 跳过harbor的安装即可

开放端口

# 开放 SSH 服务
sudo iptables -A INPUT -p tcp --dport 22 -j ACCEPT

# 开放 etcd 服务
sudo iptables -A INPUT -p tcp --dport 2379 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 2380 -j ACCEPT

# 开放 apiserver 服务
sudo iptables -A INPUT -p tcp --dport 6443 -j ACCEPT

# 开放 calico 服务
sudo iptables -A INPUT -p tcp --dport 9099 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 9100 -j ACCEPT

# 开放 BGP 服务
sudo iptables -A INPUT -p tcp --dport 179 -j ACCEPT

# 开放 NodePort 服务
sudo iptables -A INPUT -p tcp --dport 30000:32767 -j ACCEPT

# 开放 Master 服务
sudo iptables -A INPUT -p tcp --dport 10250 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 10258 -j ACCEPT

# 开放 DNS 服务 (TCP)
sudo iptables -A INPUT -p tcp --dport 53 -j ACCEPT

# 开放 DNS 服务 (UDP)
sudo iptables -A INPUT -p udp --dport 53 -j ACCEPT

# 开放 metrics-server 服务
sudo iptables -A INPUT -p tcp --dport 8443 -j ACCEPT

# 开放 local-registry 服务
sudo iptables -A INPUT -p tcp --dport 5000 -j ACCEPT

# 开放 local-apt 服务
sudo iptables -A INPUT -p tcp --dport 5080 -j ACCEPT

# 开放 rpcbind 服务
sudo iptables -A INPUT -p tcp --dport 111 -j ACCEPT


检查是否生效
sudo iptables -L -n -v



默认情况下,iptables 规则在重启后会丢失。为了让规则永久生效,需要保存规则。
sudo apt update
sudo apt install iptables-persistent

保存规则

sudo netfilter-persistent save

相关报错记录

1. ethtool

[WARNING FileExisting-ethtool]: ethtool not found in system path
error execution phase preflight: [preflight] Some fatal errors occurred:


ubuntu 22.04版本 解决办法:

sudo apt-get update
sudo apt-get install ethtool

2.  container runtime is not running

 [ERROR CRI]: container runtime is not running: output: time="2024-12-10T13:46:06+08:00" level=fatal msg="validate service connection: validate CRI v1 runtime API for endpoint \"unix:///run/containerd/containerd.sock\": rpc error: code = Unimplemented desc = unknown service runtime.v1.RuntimeService"


解决办法:

查看运行状态

systemctl status containerd

状态是正常的 active (running) 

sudo vi /etc/containerd/config.toml 

注释掉 disabled_plugins = ["cri"]

然后重启 containerd

sudo systemctl restart containerd


原因分析及参考
一个关于版本的背景故事
根据k8s官网的介绍,kubernets 自v 1.24.0 后,移除了 docker.shim(k8s集成的docker),替换采用 containerd 作为容器运行时。因此需要安装 containerd
而containerd是docker的子项目,现在他俩分开了,所以可以单独安装containerd

关于containerd的介绍
常用的容器运行时有docker、containerd、CRI-O等
containerd是一个CRI(Container Runtime Interface)组件,在容器运行时调用containerd组件来创建、运行、销毁容器等
CRI组件遵循OCI规范,通过runc实现与操作系统内核的交互,然后实现创建和运行容器
docker使用containerd作为运行时,k8s使用containerd、CRI-O等

报错内容中的内容分析
CRI Container Runtime Interface 容器运行时接口
container runtime is not running 容器运行时未启动
validate service connection 无效的服务连接
CRI v1 runtime API is not implemented for endpoint “unix:///var/run/containerd/containerd.sock” 容器运行时接口 v1 运行时 接口 没有实现节点文件sock,应该就是此文件未找到

containerd安装的默认禁用(重点)
使用安装包安装的containerd会默认禁用作为容器运行时的功能,即安装包安装containerd后默认禁用containerd作为容器运行时
这个时候使用k8s就会报错了,因为没有容器运行时可以用
开启方法就是将/etc/containerd/config.toml文件中的disabled_plugins的值的列表中不包含cri
修改后重启containerd才会生效

3. 安装 sudo apt install ipvsadm

4.安装 sudo apt install chrony

经过以上步骤 etcd启动失败, 仍然报错

14:25:27 CST message: [master]
etcd health check failed: Failed to exec command: sudo -E /bin/bash -c "export ETCDCTL_API=2;export ETCDCTL_CERT_FILE='/etc/ssl/etcd/ssl/admin-master.pem';export ETCDCTL_KEY_FILE='/etc/ssl/etcd/ssl/admin-master-key.pem';export ETCDCTL_CA_FILE='/etc/ssl/etcd/ssl/ca.pem';/usr/local/bin/etcdctl --endpoints=https://172.16.21.35:2379 cluster-health | grep -q 'cluster is healthy'" 
Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.16.21.35:2379: connect: connection refused



检查状态

 sudo systemctl status etcd

● etcd.service - etcd
     Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: enabled)
     Active: activating (auto-restart) (Result: exit-code) since Tue 2024-12-10 14:55:26 CST; 2s ago
    Process: 6813 ExecStart=/usr/local/bin/etcd (code=exited, status=1/FAILURE)
   Main PID: 6813 (code=exited, status=1/FAILURE)

是失败状态

重新启动  sudo systemctl start etcd


查看错误信息

journalctl -xe


Dec 10 15:07:03 master etcd[8160]: {"level":"fatal","ts":"2024-12-10T15:07:03.963267+0800","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"cannot fetch cluster info from peer urls: could not retrieve cluster information from the given URLs","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\tgo.etcd.io/etcd/server/v3/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\tgo.etcd.io/etcd/server/v3/etcdmain/main.go:40\nmain.main\n\tgo.etcd.io/etcd/server/v3/main.go:31\nruntime.main\n\truntime/proc.go:267"}


发现一个问题:

telnet 172.16.21.35 2379 不通
telent 127.0.0.1 2379 是通的


通过检查配置文件  cat /etc/etcd.env

发现问题 , 修改成正常的
ETCDCTL_ENDPOINTS=https://172.16.21.35:2379  
ETCD_LISTEN_CLIENT_URLS=https://172.16.21.35:2379
ETCD_LISTEN_CLIENT_URLS=https://0.0.0.0:2379
ETCD_LISTEN_PEER_URLS=https://0.0.0.0:2380


启动仍然发现绑定 127, 检查 /etc/host, 发现了一条

127.0.1.1      master

修改成

172.16.21.35 master


发现这问题主要是通过命令行  sudo /usr/local/bin/etcd 是可以启动的 但是通过service启动失败, 因为service配置的环境变量对应的ip是 172.16.21.35

重启  sudo systemctl start etcd
 
 sudo systemctl status etcd

依然启动不了etcd, 检查错误日志 , root账号执行 journalctl -u etcd -xe > 1.log

找到最新的有价值的日志

Dec 10 16:01:22 master etcd[18172]: {"level":"info","ts":"2024-12-10T16:01:22.959584+0800","caller":"embed/etcd.go:308","msg":"starting an etcd server","etcd-version":"3.5.13","git-sha":"c9063a0dc","go-version":"go1.21.8","go-os":"linux","go-arch":"amd64","max-cpu-set":48,"max-cpu-available":48,"member-initialized":false,"name":"etcd-master","data-dir":"/var/lib/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/etcd/member","force-new-cluster":false,"heartbeat-interval":"250ms","election-timeout":"5s","initial-election-tick-advance":true,"snapshot-count":10000,"max-wals":5,"max-snapshots":5,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["https://172.16.21.35:2380"],"listen-peer-urls":["https://172.16.21.35:2380"],"advertise-client-urls":["https://172.16.21.35:2379"],"listen-client-urls":["https://172.16.21.35:2379"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"etcd-master=https://172.16.21.35:2380","initial-cluster-state":"existing","initial-cluster-token":"k8s_etcd","quota-backend-bytes":2147483648,"max-request-bytes":1572864,"max-concurrent-streams":4294967295,"pre-vote":true,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","compact-check-time-enabled":false,"compact-check-time-interval":"1m0s","auto-compaction-mode":"periodic","auto-compaction-retention":"8h0m0s","auto-compaction-interval":"8h0m0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
Dec 10 16:01:22 master etcd[18172]: {"level":"warn","ts":"2024-12-10T16:01:22.959654+0800","caller":"fileutil/fileutil.go:53","msg":"check file permission","error":"directory \"/var/lib/etcd\" exist, but the permission is \"drwxr-xr-x\". The recommended permission is \"-rwx------\" to prevent possible unprivileged access to the data"}
Dec 10 16:01:22 master etcd[18172]: {"level":"info","ts":"2024-12-10T16:01:22.960848+0800","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/var/lib/etcd/member/snap/db","took":"1.016372ms"}
Dec 10 16:01:22 master etcd[18172]: {"level":"info","ts":"2024-12-10T16:01:22.961965+0800","caller":"embed/etcd.go:375","msg":"closing etcd server","name":"etcd-master","data-dir":"/var/lib/etcd","advertise-peer-urls":["https://172.16.21.35:2380"],"advertise-client-urls":["https://172.16.21.35:2379"]}
Dec 10 16:01:22 master etcd[18172]: {"level":"info","ts":"2024-12-10T16:01:22.962028+0800","caller":"embed/etcd.go:377","msg":"closed etcd server","name":"etcd-master","data-dir":"/var/lib/etcd","advertise-peer-urls":["https://172.16.21.35:2380"],"advertise-client-urls":["https://172.16.21.35:2379"]}
Dec 10 16:01:22 master etcd[18172]: {"level":"fatal","ts":"2024-12-10T16:01:22.962054+0800","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"cannot fetch cluster info from peer urls: could not retrieve cluster information from the given URLs","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\tgo.etcd.io/etcd/server/v3/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\tgo.etcd.io/etcd/server/v3/etcdmain/main.go:40\nmain.main\n\tgo.etcd.io/etcd/server/v3/main.go:31\nruntime.main\n\truntime/proc.go:267"}
Dec 10 16:01:22 master systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE
-- Subject: Unit process exited

日志显示警告信息,提示 /var/lib/etcd 的目录权限不符合推荐标准。推荐的权限为 700 (drwx------),而当前目录权限为 755 (drwxr-xr-x),可能会导致数据被未授权用户访问。

解决方法
修改目录权限 按照推荐权限修改 /var/lib/etcd 目录:


chmod 700 /var/lib/etcd
检查目录所有权 确保该目录的所有者和组为 etcd 用户:


chown -R etcd:etcd /var/lib/etcd
重新启动服务 修改完成后,重新启动 etcd 服务:


systemctl start etcd

修改成单节点模式

ETCD_INITIAL_CLUSTER_STATE=new


删除旧数据
rm -rf /var/lib/etcd/*

启动 sudo systemctl start etcd


启动成功

接着执行启动

./kk delete cluster -f config-sample.yaml
./kk create cluster -f config-sample.yaml -a kubesphere.tar.gz --with-local-storage

报证书问题

 [WARNING ImagePull]: failed to pull image dockerhub.kubekey.local/kubesphereio/coredns:1.9.3: output: E1210 16:48:21.678224   36794 remote_image.go:180] "PullImage from image service failed" err="rpc error: code = Unknown desc = failed to pull and unpack image \"dockerhub.kubekey.local/kubesphereio/coredns:1.9.3\": failed to resolve reference \"dockerhub.kubekey.local/kubesphereio/coredns:1.9.3\": failed to do request: Head \"https://dockerhub.kubekey.local/v2/kubesphereio/coredns/manifests/1.9.3\": tls: failed to verify certificate: x509: certificate signed by unknown authority" image="dockerhub.kubekey.local/kubesphereio/coredns:1.9.3"
time="2024-12-10T16:48:21+08:00" level=fatal msg="pulling image: failed to pull and unpack image \"dockerhub.kubekey.local/kubesphereio/coredns:1.9.3\": failed to resolve reference \"dockerhub.kubekey.local/kubesphereio/coredns:1.9.3\": failed to do request: Head \"https://dockerhub.kubekey.local/v2/kubesphereio/coredns/manifests/1.9.3\": tls: failed to verify certificate: x509: certificate signed by unknown authority"
, error: exit status 1

这是一个告警可以忽略不计

sudo mkdir -p /etc/docker/certs.d/dockerhub.kubekey.local
发现此目录 /etc/docker/certs.d/dockerhub.kubekey.local 已经有 harbor的证书

报错

error: Pipeline[CreateClusterPipeline] execute failed: Module[KubernetesStatusModule] exec failed: 
failed: [master] [GetClusterStatus] exec failed after 3 retries: get kubernetes cluster info failed: Failed to exec command: sudo -E /bin/bash -c "/usr/local/bin/kubectl --no-headers=true get nodes -o custom-columns=:metadata.name,:status.nodeInfo.kubeletVersion,:status.addresses" 
E1211 10:19:34.703393   25975 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E1211 10:19:34.704539   25975 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E1211 10:19:34.705431   25975 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E1211 10:19:34.707890   25975 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E1211 10:19:34.708539   25975 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused

切换到root账号

主要问题,使用 journalctl -xeu kubelet 查看错误日志

Dec 11 10:26:58 master kubelet[48502]: E1211 10:26:58.289043   48502 remote_runtime.go:176] "RunPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = failed to get sandbox image \"registry.k8s.io/pause:3.8\": failed to pull image \"registry.k8s.io/pause:3.8\": failed to pull and unpack image \"registry.k8s.io/pause:3.8\": failed to resolve reference \"registry.k8s.io/pause:3.8\": failed to do request: Head \"https://registry.k8s.io/v2/pause/manifests/3.8\": dial tcp 34.96.108.209:443: i/o timeout"
Dec 11 10:26:58 master kubelet[48502]: E1211 10:26:58.289224   48502 kuberuntime_sandbox.go:72] "Failed to create sandbox for pod" err="rpc error: code = DeadlineExceeded desc = failed to get sandbox image \"registry.k8s.io/pause:3.8\": failed to pull image \"registry.k8s.io/pause:3.8\": failed to pull and unpack image \"registry.k8s.io/pause:3.8\": failed to resolve reference \"registry.k8s.io/pause:3.8\": failed to do request: Head \"https://registry.k8s.io/v2/pause/manifests/3.8\": dial tcp 34.96.108.209:443: i/o timeout" pod="kube-system/kube-scheduler-master"
Dec 11 10:26:58 master kubelet[48502]: E1211 10:26:58.289289   48502 kuberuntime_manager.go:782] "CreatePodSandbox for pod failed" err="rpc error: code = DeadlineExceeded desc = failed to get sandbox image \"registry.k8s.io/pause:3.8\": failed to pull image \"registry.k8s.io/pause:3.8\": failed to pull and unpack image \"registry.k8s.io/pause:3.8\": failed to resolve reference \"registry.k8s.io/pause:3.8\": failed to do request: Head \"https://registry.k8s.io/v2/pause/manifests/3.8\": dial tcp 34.96.108.209:443: i/o timeout" pod="kube-system/kube-scheduler-master"
Dec 11 10:26:58 master kubelet[48502]: E1211 10:26:58.289468   48502 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"kube-scheduler-master_kube-system(4ca7fb2db07d0f724baa8308d590dcb6)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"kube-scheduler-master_kube-system(4ca7fb2db07d0f724baa8308d590dcb6)\\\": rpc error: code = DeadlineExceeded desc = failed to get sandbox image \\\"registry.k8s.io/pause:3.8\\\": failed to pull image \\\"registry.k8s.io/pause:3.8\\\": failed to pull and unpack image \\\"registry.k8s.io/pause:3.8\\\": failed to resolve reference \\\"registry.k8s.io/pause:3.8\\\": failed to do request: Head \\\"https://registry.k8s.io/v2/pause/manifests/3.8\\\": dial tcp 34.96.108.209:443: i/o timeout\"" pod="kube-system/kube-scheduler-master" podUID=4ca7fb2db07d0f724baa8308d590dcb6
Dec 11 10:26:58 master kubelet[48502]: E1211 10:26:58.289619   48502 remote_runtime.go:176] "RunPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = failed to get sandbox image \"registry.k8s.io/pause:3.8\": failed to pull image \"registry.k8s.io/pause:3.8\": failed to pull and unpack image \"registry.k8s.io/pause:3.8\": failed to resolve reference \"registry.k8s.io/pause:3.8\": failed to do request: Head \"https://registry.k8s.io/v2/pause/manifests/3.8\": dial tcp 34.96.108.209:443: i/o timeout"
Dec 11 10:26:58 master kubelet[48502]: E1211 10:26:58.289697   48502 kuberuntime_sandbox.go:72] "Failed to create sandbox for pod" err="rpc error: code = DeadlineExceeded desc = failed to get sandbox image \"registry.k8s.io/pause:3.8\": failed to pull image \"registry.k8s.io/pause:3.8\": failed to pull and unpack image \"registry.k8s.io/pause:3.8\": failed to resolve reference \"registry.k8s.io/pause:3.8\": failed to do request: Head \"https://registry.k8s.io/v2/pause/manifests/3.8\": dial tcp 34.96.108.209:443: i/o timeout" pod="kube-system/kube-controller-manager-master"
Dec 11 10:26:58 master kubelet[48502]: E1211 10:26:58.289740   48502 kuberuntime_manager.go:782] "CreatePodSandbox for pod failed" err="rpc error: code = DeadlineExceeded desc = failed to get sandbox image \"registry.k8s.io/pause:3.8\": failed to pull image \"registry.k8s.io/pause:3.8\": failed to pull and unpack image \"registry.k8s.io/pause:3.8\": failed to resolve reference \"registry.k8s.io/pause:3.8\": failed to do request: Head \"https://registry.k8s.io/v2/pause/manifests/3.8\": dial tcp 34.96.108.209:443: i/o timeout" pod="kube-system/kube-controller-manager-master"
Dec 11 10:26:58 master kubelet[48502]: E1211 10:26:58.289833   48502 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"kube-controller-manager-master_kube-system(f4e475d9dffaba24cb459a418e20d79b)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"kube-controller-manager-master_kube-system(f4e475d9dffaba24cb459a418e20d79b)\\\": rpc error: code = DeadlineExceeded desc = failed to get sandbox image \\\"registry.k8s.io/pause:3.8\\\": failed to pull image \\\"registry.k8s.io/pause:3.8\\\": failed to pull and unpack image \\\"registry.k8s.io/pause:3.8\\\": failed to resolve reference \\\"registry.k8s.io/pause:3.8\\\": failed to do request: Head \\\"https://registry.k8s.io/v2/pause/manifests/3.8\\\": dial tcp 34.96.108.209:443: i/o timeout\"" pod="kube-system/kube-controller-manager-master" podUID=f4e475d9dffaba24cb459a418e20d79b

这个错误主要原因是拉不到镜像  registry.k8s.io/pause:3.8 , 可见离线部署文档并不完善

手工下载 docker pull kubesphere/pause:3.8

然后打tag

docker tag kubesphere/pause:3.8 registry.k8s.io/pause:3.8

执行安装

./kk create cluster -f config-sample.yaml -a kubesphere.tar.gz --with-local-storage

仍然报错

get kubernetes cluster info failed: Failed to exec command: sudo -E /bin/bash -c "/usr/local/bin/kubectl --no-headers=true get nodes -o custom-columns=:metadata.name,:status.nodeInfo.kubeletVersion,:status.addresses" 
E1211 11:10:34.727469   27176 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E1211 11:10:34.728832   27176 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E1211 11:10:34.729168   27176 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E1211 11:10:34.730709   27176 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E1211 11:10:34.731018   27176 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
The connection to the server localhost:8080 was refused - did you specify the right host or port?: Process exited with status 1
11:10:34 CST failed: [master]

查看kubelet 日志  sudo journalctl -xeu kubelet 

Dec 11 11:14:45 master kubelet[48502]: E1211 11:14:45.286340   48502 remote_runtime.go:176] "RunPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = failed to get sandbox image \"registry.k8s.io/pause:3.8\": failed to pull image \"registry.k8s.io/pause:3.8\": failed to pull and unpack image \"registry.k8s.io/pause:3.8\": failed to resolve reference \"registry.k8s.io/pause:3.8\": failed to do request: Head \"https://registry.k8s.io/v2/pause/manifests/3.8\": dial tcp 34.96.108.209:443: i/o timeout"
Dec 11 11:14:45 master kubelet[48502]: E1211 11:14:45.286463   48502 kuberuntime_sandbox.go:72] "Failed to create sandbox for pod" err="rpc error: code = DeadlineExceeded desc = failed to get sandbox image \"registry.k8s.io/pause:3.8\": failed to pull image \"registry.k8s.io/pause:3.8\": failed to pull and unpack image \"registry.k8s.io/pause:3.8\": failed to resolve reference \"registry.k8s.io/pause:3.8\": failed to do request: Head \"https://registry.k8s.io/v2/pause/manifests/3.8\": dial tcp 34.96.108.209:443: i/o timeout" pod="kube-system/kube-scheduler-master"
Dec 11 11:14:45 master kubelet[48502]: E1211 11:14:45.286493   48502 kuberuntime_manager.go:782] "CreatePodSandbox for pod failed" err="rpc error: code = DeadlineExceeded desc = failed to get sandbox image \"registry.k8s.io/pause:3.8\": failed to pull image \"registry.k8s.io/pause:3.8\": failed to pull and unpack image \"registry.k8s.io/pause:3.8\": failed to resolve reference \"registry.k8s.io/pause:3.8\": failed to do request: Head \"https://registry.k8s.io/v2/pause/manifests/3.8\": dial tcp 34.96.108.209:443: i/o timeout" pod="kube-system/kube-scheduler-master"
Dec 11 11:14:45 master kubelet[48502]: E1211 11:14:45.286613   48502 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"kube-scheduler-master_kube-system(4ca7fb2db07d0f724baa8308d590dcb6)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"kube-scheduler-master_kube-system(4ca7fb2db07d0f724baa8308d590dcb6)\\\": rpc error: code = DeadlineExceeded desc = failed to get sandbox image \\\"registry.k8s.io/pause:3.8\\\": failed to pull image \\\"registry.k8s.io/pause:3.8\\\": failed to pull and unpack image \\\"registry.k8s.io/pause:3.8\\\": failed to resolve reference \\\"registry.k8s.io/pause:3.8\\\": failed to do request: Head \\\"https://registry.k8s.io/v2/pause/manifests/3.8\\\": dial tcp 34.96.108.209:443: i/o timeout\"" pod="kube-system/kube-scheduler-master" podUID=4ca7fb2db07d0f724baa8308d590dcb6
Dec 11 11:14:45 master kub

分析 kubelet状态发现是运行状态

systemctl status kubelet

kubeadm kubectl get pod --all-namespaces 报错

E1211 11:47:34.401922   27583 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused

重新启动, 检查控制台报错

The recommended value for “clusterDNS” in “KubeletConfiguration” is: [10.233.0.10]; the provided value is: [169.254.25.10]

解决办法

sudo vi /var/lib/kubelet/config.yaml 

修改如下内容

clusterDNS:
- 169.254.25.10

为

clusterDNS:
- 10.233.0.10




解决 "command failed" err="failed to validate kubelet flags: the container runtime endpoint address was not specified or empty, use --container-runtime-endpoint to set


sudo vi /etc/systemd/system/kubelet.service.d/10-kubeadm.conf

ExecStart=/usr/local/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS --container-runtime-endp
oint=unix:///run/containerd/containerd.sock

现在可以确认的是 k8s api server 没有启来, 查不到任何日志

检查 yaml

yamllint /etc/kubernetes/manifests/kube-apiserver.yaml

发现报错

通过vscode YAML插件格式化后, 估计还是解决不了问题

确实没有效果, 通过 journalctl -u kubelet -f 查看滚动日志发现错误日志

Dec 11 15:40:29 master kubelet[38099]: E1211 15:40:29.006151   38099 file.go:187] "Could not process manifest file" err="/etc/kubernetes/manifests/ystemctl status docker: couldn't parse as pod(yaml: control characters are not allowed), please check config file" path="/etc/kubernetes/manifests/ystemctl status docker"

ps aux | grep kubelet
root     38099  3.1  0.1 4755728 109976 ?      Ssl  14:33   2:12 /usr/local/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=cgroupfs --container-runtime-endpoint=unix:///run/containerd/containerd.sock --pod-infra-container-image=dockerhub.kubekey.local/kubesphereio/pause:3.9 --node-ip=172.16.21.35 --hostname-override=master

发现一个错误的文件, 删除之, 仍然没啥用

滚动的错误日志  journalctl -u kubelet -f

Dec 11 15:48:26 master kubelet[38099]: E1211 15:48:26.638739   38099 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"kube-controller-manager-master_kube-system(f4e475d9dffaba24cb459a418e20d79b)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"kube-controller-manager-master_kube-system(f4e475d9dffaba24cb459a418e20d79b)\\\": rpc error: code = Unknown desc = failed to get sandbox image \\\"registry.k8s.io/pause:3.8\\\": failed to pull image \\\"registry.k8s.io/pause:3.8\\\": failed to pull and unpack image \\\"registry.k8s.io/pause:3.8\\\": failed to resolve reference \\\"registry.k8s.io/pause:3.8\\\": failed to do request: Head \\\"https://us-west2-docker.pkg.dev/v2/k8s-artifacts-prod/images/pause/manifests/3.8\\\": dial tcp 74.125.199.82:443: i/o timeout\"" pod="kube-system/kube-controller-manager-master" podUID=f4e475d9dffaba24cb459a418e20d79b
Dec 11 15:48:29 master kubelet[38099]: I1211 15:48:29.073478   38099 status_manager.go:698] "Failed to get status for pod" podUID=9eb830c8cce30bfcab1dc46488c4c23e pod="kube-system/kube-apiserver-master" err="Get \"https://lb.kubesphere.local:6443/api/v1/namespaces/kube-system/pods/kube-apiserver-master\": dial tcp 172.16.21.35:6443: connect: connection refused"
Dec 11 15:48:29 master kubelet[38099]: E1211 15:48:29.640909   38099 eviction_manager.go:261] "Eviction manager: failed to get summary stats" err="failed to get node info: node \"master\" not found"
Dec 11 15:48:30 master kubelet[38099]: E1211 15:48:30.173752   38099 controller.go:146] failed to ensure lease exists, will retry in 7s, error: Get "https://lb.kubesphere.local:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/master?timeout=10s": dial tcp 172.16.21.35:6443: connect: connection refused
Dec 11 15:48:32 master kubelet[38099]: I1211 15:48:32.253793   38099 kubelet_node_status.go:70] "Attempting to register node" node="master"
Dec 11 15:48:32 master kubelet[38099]: E1211 15:48:32.255001   38099 kubelet_node_status.go:92] "Unable to register node with API server" err="Post \"https://lb.kubesphere.local:6443/api/v1/nodes\": dial tcp 172.16.21.35:6443: connect: connection refused" node="master"
Dec 11 15:48:33 master kubelet[38099]: E1211 15:48:33.140053   38099 event.go:276] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"kube-controller-manager-master.18100bd8e22b2b4b", GenerateName:"", Namespace:"kube-system", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"kube-controller-manager-master", UID:"f4e475d9dffaba24cb459a418e20d79b", APIVersion:"v1", ResourceVersion:"", FieldPath:""}, Reason:"FailedCreatePodSandBox", Message:"Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = failed to get sandbox image \"registry.k8s.io/pause:3.8\": failed to pull image \"registry.k8s.io/pause:3.8\": failed to pull and unpack image \"registry.k8s.io/pause:3.8\": failed to resolve reference \"registry.k8s.io/pause:3.8\": failed to do request: Head \"https://us-west2-docker.pkg.dev/v2/k8s-artifacts-prod/images/pause/manifests/3.8\": dial tcp 74.125.195.82:443: i/o timeout", Source:v1.EventSource{Component:"kubelet", Host:"master"}, FirstTimestamp:time.Date(2024, time.December, 11, 14, 34, 42, 672962379, time.Local), LastTimestamp:time.Date(2024, time.December, 11, 14, 34, 42, 672962379, time.Local), Count:1, Type:"Warning", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"kubelet", ReportingInstance:"master"}': 'Post "https://lb.kubesphere.local:6443/api/v1/namespaces/kube-system/events": dial tcp 172.16.21.35:6443: connect: connection refused'(may retry after sleeping)
Dec 11 15:48:37 master kubelet[38099]: E1211 15:48:37.175829   38099 controller.go:146] failed to ensure lease exists, will retry in 7s, error: Get "https://lb.kubesphere.local:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/master?timeout=10s": dial tcp 172.16.21.35:6443: connect: connection refused
Dec 11 15:48:39 master kubelet[38099]: I1211 15:48:39.073764   38099 status_manager.go:698] "Failed to get status for pod" podUID=9eb830c8cce30bfcab1dc46488c4c23e pod="kube-system/kube-apiserver-master" err="Get \"https://lb.kubesphere.local:6443/api/v1/namespaces/kube-system/pods/kube-apiserver-master\": dial tcp 172.16.21.35:6443: connect: connection refused"
Dec 11 15:48:39 master kubelet[38099]: I1211 15:48:39.258619   38099 kubelet_node_status.go:70] "Attempting to register node" node="master"
Dec 11 15:48:39 master kubelet[38099]: E1211 15:48:39.259595   38099 kubelet_node_status.go:92] "Unable to register node with API server" err="Post \"https://lb.kubesphere.local:6443/api/v1/nodes\": dial tcp 172.16.21.35:6443: connect: connection refused" node="master"
Dec 11 15:48:39 master kubelet[38099]: E1211 15:48:39.641239   38099 eviction_manager.go:261] "Eviction manager: failed to get summary stats" err="failed to get node info: node \"master\" not found"
Dec 11 15:48:40 master kubelet[38099]: E1211 15:48:40.183260   38099 certificate_manager.go:471] kubernetes.io/kube-apiserver-client-kubelet: Failed while requesting a signed certificate from the control plane: cannot create certificate signing request: Post "https://lb.kubesphere.local:6443/apis/certificates.k8s.io/v1/certificatesigningrequests": dial tcp 172.16.21.35:6443: connect: connection refused
Dec 11 15:48:41 master kubelet[38099]: W1211 15:48:41.010588   38099 reflector.go:424] vendor/k8s.io/client-go/informers/factory.go:150: failed to list *v1.Node: Get "https://lb.kubesphere.local:6443/api/v1/nodes?fieldSelector=metadata.name%3Dmaster&limit=500&resourceVersion=0": dial tcp 172.16.21.35:6443: connect: connection refused
Dec 11 15:48:41 master kubelet[38099]: E1211 15:48:41.010680   38099 reflector.go:140] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://lb.kubesphere.local:6443/api/v1/nodes?fieldSelector=metadata.name%3Dmaster&limit=500&resourceVersion=0": dial tcp 172.16.21.35:6443: connect: connection refused
Dec 11 15:48:42 master kubelet[38099]: E1211 15:48:42.430816   38099 file.go:108] "Unable to process watch event" err="the pod with key kube-system/kube-apiserver-master doesn't exist in cache"
Dec 11 15:48:43 master kubelet[38099]: E1211 15:48:43.141377   38099 event.go:276] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"kube-controller-manager-master.18100bd8e22b2b4b", GenerateName:"", Namespace:"kube-system", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"kube-controller-manager-master", UID:"f4e475d9dffaba24cb459a418e20d79b", APIVersion:"v1", ResourceVersion:"", FieldPath:""}, Reason:"FailedCreatePodSandBox", Message:"Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = failed to get sandbox image \"registry.k8s.io/pause:3.8\": failed to pull image \"registry.k8s.io/pause:3.8\": failed to pull and unpack image \"registry.k8s.io/pause:3.8\": failed to resolve reference \"registry.k8s.io/pause:3.8\": failed to do request: Head \"https://us-west2-docker.pkg.dev/v2/k8s-artifacts-prod/images/pause/manifests/3.8\": dial tcp 74.125.195.82:443: i/o timeout", Source:v1.EventSource{Component:"kubelet", Host:"master"}, FirstTimestamp:time.Date(2024, time.December, 11, 14, 34, 42, 672962379, time.Local), LastTimestamp:time.Date(2024, time.December, 11, 14, 34, 42, 672962379, time.Local), Count:1, Type:"Warning", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"kubelet", ReportingInstance:"master"}': 'Post "https://lb.kubesphere.local:6443/api/v1/namespaces/kube-system/events": dial tcp 172.16.21.35:6443: connect: connection refused'(may retry after sleeping)
Dec 11 15:48:44 master kubelet[38099]: E1211 15:48:44.178978   38099 controller.go:146] failed to ensure lease exists, will retry in 7s, error: Get "https://lb.kubesphere.local:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/master?timeout=10s": dial tcp 172.16.21.35:6443: connect: connection refused
Dec 11 15:48:46 master kubelet[38099]: I1211 15:48:46.263268   38099 kubelet_node_status.go:70] "Attempting to register node" node="master"
Dec 11 15:48:46 master kubelet[38099]: E1211 15:48:46.264191   38099 kubelet_node_status.go:92] "Unable to register node with API server" err="Post \"https://lb.kubesphere.local:6443/api/v1/nodes\": dial tcp 172.16.21.35:6443: connect: connection refused" node="master"
Dec 11 15:48:46 master kubelet[38099]: W1211 15:48:46.972132   38099 reflector.go:424] vendor/k8s.io/client-go/informers/factory.go:150: failed to list *v1.CSIDriver: Get "https://lb.kubesphere.local:6443/apis/storage.k8s.io/v1/csidrivers?limit=500&resourceVersion=0": dial tcp 172.16.21.35:6443: connect: connection refused
Dec 11 15:48:46 master kubelet[38099]: E1211 15:48:46.972238   38099 reflector.go:140] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.CSIDriver: failed to list *v1.CSIDriver: Get "https://lb.kubesphere.local:6443/apis/storage.k8s.io/v1/csidrivers?limit=500&resourceVersion=0": dial tcp 172.16.21.35:6443: connect: connection refused
Dec 11 15:48:49 master kubelet[38099]: I1211 15:48:49.007214   38099 topology_manager.go:210] "Topology Admit Handler" podUID=9eb830c8cce30bfcab1dc46488c4c23e podNamespace="kube-system" podName="kube-apiserver-master"
Dec 11 15:48:49 master kubelet[38099]: E1211 15:48:49.642259   38099 eviction_manager.go:261] "Eviction manager: failed to get summary stats" err="failed to get node info: node \"master\" not found"
Dec 11 15:48:51 master kubelet[38099]: E1211 15:48:51.180776   38099 controller.go:146] failed to ensure lease exists, will retry in 7s, error: Get "https://lb.kubesphere.local:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/master?timeout=10s": dial tcp 172.16.21.35:6443: connect: connection refused

先解决拉取镜像 pause 失败的问题

containerd config default > /etc/containerd/config.toml


sudo vi /etc/containerd/config.toml

[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
  [plugins."io.containerd.grpc.v1.cri".registry.mirrors."registry.k8s.io"]
    endpoint = ["https://xxxxxxxxx.mirror.swr.myhuaweicloud.com"]
  [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
    endpoint = ["https://xxxxxxxxx.mirror.swr.myhuaweicloud.com"]

这里的 https://registry.aliyuncs.com  替换成自己的 参考教程 https://support.huaweicloud.com/usermanual-swr/swr_01_0045.html


systemctl restart containerd


没啥效果

将 Docker 中的镜像导入到 containerd

docker save registry.k8s.io/pause:3.8 -o pause.tar 保存

ctr -n k8s.io images import pause.tar 导入

执行安装

./kk create cluster -f config-sample.yaml -a kubesphere.tar.gz --with-local-storage

发现错误

Dec 11 16:48:36 master kubelet[43735]: E1211 16:48:36.464582   43735 remote_image.go:171] "PullImage from image service failed" err="rpc error: code = Unknown desc = failed to pull and unpack image \"dockerhub.kubekey.local/kubesphereio/kube-scheduler:v1.26.12\": failed to resolve reference \"dockerhub.kubekey.local/kubesphereio/kube-scheduler:v1.26.12\": failed to do request: Head \"https://dockerhub.kubekey.local/v2/kubesphereio/kube-scheduler/manifests/v1.26.12\": tls: failed to verify certificate: x509: certificate signed by unknown authority" image="dockerhub.kubekey.local/kubesphereio/kube-scheduler:v1.26.12"

Dec 11 16:48:36 master kubelet[43735]: E1211 16:48:36.464615   43735 kuberuntime_image.go:53] "Failed to pull image" err="rpc error: code = Unknown desc = failed to pull and unpack image \"dockerhub.kubekey.local/kubesphereio/kube-scheduler:v1.26.12\": failed to resolve reference \"dockerhub.kubekey.local/kubesphereio/kube-scheduler:v1.26.12\": failed to do request: Head \"https://dockerhub.kubekey.local/v2/kubesphereio/kube-scheduler/manifests/v1.26.12\": tls: failed to verify certificate: x509: certificate signed by unknown authority" image="dockerhub.kubekey.local/kubesphereio/kube-scheduler:v1.26.12"


通过手工的方式 pull

docker pull dockerhub.kubekey.local/kubesphereio/kube-apiserver:v1.26.12
docker pull dockerhub.kubekey.local/kubesphereio/kube-scheduler:v1.26.12


然后报错变成

Dec 11 16:55:45 master kubelet[43735]: E1211 16:55:45.453268   43735 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-apiserver\" with ImagePullBackOff: \"Back-off pulling image \\\"dockerhub.kubekey.local/kubesphereio/kube-apiserver:v1.26.12\\\"\"" pod="kube-system/kube-apiserver-master" podUID=9eb830c8cce30bfcab1dc46488c4c23e


这个报错是因为我把harbor装在了别的机器的原因, 容器里网络不通导致的?
如果处理?


可以尝试使用宿主机网络启动 Pod:

yaml
复制代码
spec:
  hostNetwork: true

上述仍然不行, 准备使用在线方案安装试试

export KKZONE=cn
./kk create cluster --with-kubernetes v1.22.12 --with-kubesphere v3.4.1


果然比离线安装好多了, 至少 docKer镜像都创建了.


Console: http://172.16.21.35:30880
Account: admin
Password: P@88w0rd

总结: 离线安装太不靠谱了, 问题很多, 很难解决, 学习成本太高

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值