More than 5 years have passed since last update.

(GKE)KubernetesにおけるProduction環境構築 (III)分散ストレージ Rook Cephの構築

Last updated at 2019-04-10Posted at 2019-04-03

背景：　
pod間でデータを共有したいときに、AWSではAmazon Elastic Block Store (Amazon EBS)のNFSをMountすれば良いが、GCPにはpodごとにPersistent DiskをMountすることがしかできない。 GCPのNFSとしてはCloud Filestoreあるんですが、The minimum Standard tier instance size is 1 terabyte (TB)と書いてあるので、非常に値段が高い。なので、Ceph分散ストレージをGKEにdeployするのが選択肢になった。そこでCephのdepolyは非常に面倒なので、Cloud NativeのRookを活かす。
さらに Taint/Toleration とNode selectorを用いて、指定したNode poolにCeph ClusterをDeployし、
他のPodがCeph Clusterに侵入しないようにDeployした。

手順:

もしすでに本記事の詳細部分を見ましたら、以下のコマンドで便利に組み立てる

====================================================

1.
helm install --namespace rook-ceph-system --name rook-ceph rook-stable/rook-ceph -f values.yaml
kubectl --namespace rook-ceph-system get pods -l "app=rook-ceph-operator"
kubectl get po -n rook-ceph-system
2.
kubectl apply -f cluster.yaml
### rook-ceph-mon-* のSTATUSがCrashLoopBackOffのときに VMにsshでloginし sudo rm -rf /var/lib/rookを実行してください　###
kubectl get po -n rook-ceph -o wide
kubectl logs `kubectl get pod -n rook-ceph-system | grep -F -i 'rook-ceph-operator' | awk '{print $1}'` -n rook-ceph-system
3.
kubectl apply -f filesystem.yaml
### CephFilesystemのpodを立ち上がるまでに3~5分かかる###
kubectl get po -n rook-ceph

4.

cat <<EOF > testfs.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 2
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.7.9
        ports:
        - containerPort: 80
        volumeMounts:
        - name: nginx-data
          mountPath: /usr/share/nginx/html
      volumes:
      - name: nginx-data
        flexVolume:
          driver: ceph.rook.io/rook
          fsType: ceph
          options:
            fsName: myfs
            clusterNamespace: rook-ceph
EOF
kubectl apply -f testfs.yaml

kubectl delete -f testfs.yaml
kubectl delete -f filesystem.yaml
kubectl delete -f cluster.yaml
kubectl -n rook-ceph patch cephclusters.ceph.rook.io rook-ceph -p '{"metadata":{"finalizers": []}}' --type=merge
kubectl delete ns rook-ceph
kubectl delete ns rook-ceph-system
helm del rook-ceph --purge

====================================================

以下は詳細

I Helmを用いてRook OperatorのDeploy

cat <<EOF > values.yaml
nodeSelector:
  role: app-node
tolerations:
- key: node-type
  operator: Equal
  value: app-node
  effect: "NoSchedule"
      

agent:
  flexVolumeDirPath:
    /home/kubernetes/flexvolume
  toleration:
    "NoSchedule"
discover:
  toleration:
    "NoSchedule"
EOF

helm install --namespace rook-ceph-system --name rook-ceph rook-stable/rook-ceph -f values.yaml

II. Node AffinityとNode Taintsを用いてCeph ClusterのDeploy
ここのspec.placement.all.nodeAffinityはosd(object storage deamon),mon,mgrがどの物理nodeに優先deploy意味であり、物理nodeのtoleration(values.yaml)と必ず整合性を取ってください
cluster.yaml

apiVersion: v1
kind: Namespace
metadata:
  name: rook-ceph
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: rook-ceph-osd
  namespace: rook-ceph
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: rook-ceph-mgr
  namespace: rook-ceph
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: rook-ceph-osd
  namespace: rook-ceph
rules:
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: [ "get", "list", "watch", "create", "update", "delete" ]
---
# Aspects of ceph-mgr that require access to the system namespace
kind: Role
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: rook-ceph-mgr-system
  namespace: rook-ceph
rules:
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
  - list
  - watch
---
# Aspects of ceph-mgr that operate within the cluster's namespace
kind: Role
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: rook-ceph-mgr
  namespace: rook-ceph
rules:
- apiGroups:
  - ""
  resources:
  - pods
  - services
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - batch
  resources:
  - jobs
  verbs:
  - get
  - list
  - watch
  - create
  - update
  - delete
- apiGroups:
  - ceph.rook.io
  resources:
  - "*"
  verbs:
  - "*"
---
# Allow the operator to create resources in this cluster's namespace
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: rook-ceph-cluster-mgmt
  namespace: rook-ceph
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: rook-ceph-cluster-mgmt
subjects:
- kind: ServiceAccount
  name: rook-ceph-system
  namespace: rook-ceph-system
---
# Allow the osd pods in this namespace to work with configmaps
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: rook-ceph-osd
  namespace: rook-ceph
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: rook-ceph-osd
subjects:
- kind: ServiceAccount
  name: rook-ceph-osd
  namespace: rook-ceph
---
# Allow the ceph mgr to access the cluster-specific resources necessary for the mgr modules
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: rook-ceph-mgr
  namespace: rook-ceph
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: rook-ceph-mgr
subjects:
- kind: ServiceAccount
  name: rook-ceph-mgr
  namespace: rook-ceph
---
# Allow the ceph mgr to access the rook system resources necessary for the mgr modules
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: rook-ceph-mgr-system
  namespace: rook-ceph-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: rook-ceph-mgr-system
subjects:
- kind: ServiceAccount
  name: rook-ceph-mgr
  namespace: rook-ceph
---
# Allow the ceph mgr to access cluster-wide resources necessary for the mgr modules
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: rook-ceph-mgr-cluster
  namespace: rook-ceph
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: rook-ceph-mgr-cluster
subjects:
- kind: ServiceAccount
  name: rook-ceph-mgr
  namespace: rook-ceph
---
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
  cephVersion:
    # The container image used to launch the Ceph daemon pods (mon, mgr, osd, mds, rgw).
    # v12 is luminous, v13 is mimic, and v14 is nautilus.
    # RECOMMENDATION: In production, use a specific version tag instead of the general v13 flag, which pulls the latest release and could result in different
    # versions running within the cluster. See tags available at https://hub.docker.com/r/ceph/ceph/tags/.
    image: ceph/ceph:v13.2.4-20190109
    # Whether to allow unsupported versions of Ceph. Currently only luminous and mimic are supported.
    # After nautilus is released, Rook will be updated to support nautilus.
    # Do not set to true in production.
    allowUnsupported: false
  # The path on the host where configuration files will be persisted. If not specified, a kubernetes emptyDir will be created (not recommended).
  # Important: if you reinstall the cluster, make sure you delete this directory from each host or else the mons will fail to start on the new cluster.
  # In Minikube, the '/data' directory is configured to persist across reboots. Use "/data/rook" in Minikube environment.
  dataDirHostPath: /var/lib/rook
  # set the amount of mons to be started
  mon:
    count: 3
    allowMultiplePerNode: true
  # enable the ceph dashboard for viewing cluster status
  dashboard:
    enabled: true
    # serve the dashboard under a subpath (useful when you are accessing the dashboard via a reverse proxy)
    # urlPrefix: /ceph-dashboard
    # serve the dashboard at the given port.
    # port: 8443
    # serve the dashboard using SSL
    # ssl: true
  network:
    # toggle to use hostNetwork
    hostNetwork: false
  rbdMirroring:
    # The number of daemons that will perform the rbd mirroring.
    # rbd mirroring must be configured with "rbd mirror" from the rook toolbox.
    workers: 0
  # To control where various services will be scheduled by kubernetes, use the placement configuration sections below.
  # The example under 'all' would have all services scheduled on kubernetes nodes labeled with 'role=storage-node' and
  # tolerate taints with a key of 'storage-node'.
  placement:
    osd:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: role
              operator: In
              values:
              - app-node
      tolerations:
      - key: node-type
        operator: Equal
        value: app-node
        effect: "NoSchedule"
    all:
      tolerations:
      - key: node-type
        operator: Equal
        value: app-node
        effect: "NoSchedule"
# The above placement information can also be specified for mon, osd, and mgr components
#    mon:
#    osd:
#    mgr:
  resources:
# The requests and limits set here, allow the mgr pod to use half of one CPU core and 1 gigabyte of memory
#    mgr:
#      limits:
#        cpu: "500m"
#        memory: "1024Mi"
#      requests:
#        cpu: "500m"
#        memory: "1024Mi"
# The above example requests/limits can also be added to the mon and osd components
#    mon:
#    osd:
  storage: # cluster level storage configuration and selection
    useAllNodes: true
    useAllDevices: false
    deviceFilter:
    location:
    config:
      # The default and recommended storeType is dynamically set to bluestore for devices and filestore for directories.
      # Set the storeType explicitly only if it is required not to use the default.
      # storeType: bluestore
      databaseSizeMB: "1024" # this value can be removed for environments with normal sized disks (100 GB or larger)
      journalSizeMB: "1024"  # this value can be removed for environments with normal sized disks (20 GB or larger)
      osdsPerDevice: "1" # this value can be overridden at the node or device level
# Cluster level list of directories to use for storage. These values will be set for all nodes that have no `directories` set.
#    directories:
#    - path: /rook/storage-dir
# Individual nodes and their config can be specified as well, but 'useAllNodes' above must be set to false. Then, only the named
# nodes below will be used as storage resources.  Each node's 'name' field should match their 'kubernetes.io/hostname' label.
#    nodes:
#    - name: "172.17.4.101"
#      directories: # specific directories to use for storage can be specified for each node
#      - path: "/rook/storage-dir"
#      resources:
#        limits:
#          cpu: "500m"
#          memory: "1024Mi"
#        requests:
#          cpu: "500m"
#          memory: "1024Mi"
#    - name: "172.17.4.201"
#      devices: # specific devices to use for storage can be specified for each node
#      - name: "sdb"
#      - name: "nvme01" # multiple osds can be created on high performance devices
#        config:
#          osdsPerDevice: "5"
#      config: # configuration can be specified at the node level which overrides the cluster level config
#        storeType: filestore
#    - name: "172.17.4.301"
#      deviceFilter: "^sd."

kubectl apply -f cluster.yaml

ここで、もし前回ceph-clusterをdeployしたことがあった場合、次のエラー出る:
The keyring does not match the existing keyring in /var/lib/rook/mon-a/data/keyring. You may need to delete the contents of dataDirHostPath on the host from a previous deployment.
対処案:
GCPのVM INSTANCEでSSHでnodeにloginし, sudo rm -rf /var/lib/rookを実行する

III
Rook Operatorを更新したいときに

helm upgrade --namespace rook-ceph-system rook-ceph rook-stable/rook-ceph -f values.yaml

IV Node AffinityとNode Taintを用いてFilesystemをProvisioningする
filesystem.yaml

apiVersion: ceph.rook.io/v1
kind: CephFilesystem
metadata:
  name: myfs
  namespace: rook-ceph
spec:   

  # The metadata pool spec
  metadataPool:
    replicated:
      # Increase the replication size if you have more than one osd
      size: 1
  # The list of data pool specs
  dataPools:
    - failureDomain: osd
      replicated:
        size: 1
  # The metadata service (mds) configuration
  metadataServer:
    # The number of active MDS instances
    activeCount: 1
    # Whether each active MDS instance will have an active standby with a warm metadata cache for faster failover.
    # If false, standbys will be available, but will not have a warm cache.
    activeStandby: true
    # The affinity rules to apply to the mds deployment

    placement:
      # nodeAffinity:
      #   requiredDuringSchedulingIgnoredDuringExecution:
      #     nodeSelectorTerms:
      #     - matchExpressions:
      #       - key: role
      #         operator: In
      #         values:
      #         - storage-node
      tolerations:
      - key: "node-type"
        operator: "Equal"
        value: "app-node"
        effect: "NoSchedule"
    #  tolerations:
    #  - key: mds-node
    #    operator: Exists
    #  podAffinity:
    #  podAntiAffinity:
    resources:
    # The requests and limits set here, allow the filesystem MDS Pod(s) to use half of one CPU core and 1 gigabyte of memory
    #  limits:
    #    cpu: "500m"
    #    memory: "1024Mi"
    #  requests:
    #    cpu: "500m"
    #    memory: "1024Mi"

 kubectl apply -f filesystem.yaml

V.Rook CephをUninstallする
泥臭いのやり方：
0.kubectl delete -f filesystem.yaml
1.kubectl delete -f cluster.yaml
2.helm del rook-ceph --purge
3.kubectl delete daemonset --all -n rook-ceph-system
4.kubectl delete ns rook-ceph
5.kubectl delete crd cephclusters.ceph.rook.io
6.kubectl delete ns rook-ceph-system
Step 4を実行した後に、kubectl get nsしたらrook-cephずっとTerminatingのStatusになることがあるそこで kubectl -n rook-ceph patch cephclusters.ceph.rook.io rook-ceph -p '{"metadata":{"finalizers": []}}' --type=merge を実行してください

やや綺麗なやり方:
0.kubectl delete -f filesystem.yaml
1.kubectl delete -f cluster.yaml
2.kubectl delete ns rook-ceph-system
3.helm del rook-ceph --purge

非常にいい参考になるページ

Rook
Rook本家のQuick Start

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up