First of all I’d rather not to buy hardware blindly and then start testing the system and it’s configurations, so this post will be about testing a single node ceph cluster in a virtual environment.
My Setup
I’m running everything on the following hardware configuration:
- Computing: Intel Core i7-12700 16GB RAM DDR4@3200MT/s
- Storage:
- 1x NVME 500GB disk
- 3x WD-RED 1TB HDD 5400 rpm (WDC WD10EFRX-68F) in RAID5 configuration (Intel RST + mdadm)
Host performances
Here are some approximate measurements of read/write speeds on the disks done with the commands:
sudo hdparm -Tt <disk> #for read performances
dd if=/dev/zero of=test.bin bs=1M count=2048 conv=fsync status=progress
HDD RAID speed reported to be
READS:
Timing cached reads: 39396 MB in 2.00 seconds = 19732.09 MB/sec
Timing buffered disk reads: 862 MB in 3.01 seconds = 286.77 MB/sec
WRITES:
2048+0 records in
2048+0 records out
2147483648 bytes (2,1 GB, 2,0 GiB) copied, 21,1149 s, 102 MB/s
NVME (Only writes):
2048+0 records in
2048+0 records out
2147483648 bytes (2,1 GB, 2,0 GiB) copied, 1,51033 s, 1,4 GB/s
1 - Spinning up the VM
I Used a virtual machine with 4 cores and 4096MB of ram allocated and Rocky Linux 9 minimal installation. For the storage configuration I’ve gone with the following:
- 1 boot disk (16GB) on my boot disk (nvme) qcow2
- 3 disks (8GB each) on the mechanical storage in raw format fully allocated on sata bus (this is done for performance validation, qcow2 is SLOOOW)
- 2 disks (6GB each) on the nvme disk in raw format fully allocated on SATA bus for testing out cache tiering performances
Base system setup
sudo dnf update
#Set hostname to ceph-test
sudo hostnamectl set-hostname ceph-test
Testing speeds
#Get disks and labels
[stefano@ceph-test ~]$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 8G 0 disk
sdb 8:16 0 8G 0 disk
sdc 8:32 0 8G 0 disk
sdd 8:48 0 6G 0 disk
sde 8:64 0 6G 0 disk
sr0 11:0 1 1024M 0 rom
vda 252:0 0 16G 0 disk
├─vda1 252:1 0 1G 0 part /boot
└─vda2 252:2 0 15G 0 part
├─rl-root 253:0 0 13.4G 0 lvm /
└─rl-swap 253:1 0 1.6G 0 lvm [SWAP]
#Testing out speeds
[stefano@ceph-test ~]$ sudo dnf install hdparm
# Boot disk speed (vda)
[stefano@ceph-test ~]$ sudo hdparm -Tt /dev/vda
/dev/vda:
Timing cached reads: 39310 MB in 2.00 seconds = 19694.27 MB/sec
Timing buffered disk reads: 7938 MB in 3.00 seconds = 2645.92 MB/sec
[stefano@ceph-test ~]$ dd if=/dev/zero of=test.bin bs=1M count=1024 conv=fsync status=progress
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.836329 s, 1.3 GB/s
# SSDs
[stefano@ceph-test ~]$ sudo hdparm -Tt /dev/sde
/dev/sde:
Timing cached reads: 37702 MB in 2.00 seconds = 18887.34 MB/sec
Timing buffered disk reads: 4096 MB in 1.68 seconds = 2434.00 MB/sec
[stefano@ceph-test ~]$ sudo dd if=/dev/zero of=/dev/sde bs=1M count=1024 conv=fsync status=progress
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.16804 s, 919 MB/s
#HDDSs
[stefano@ceph-test ~]$ sudo hdparm -Tt /dev/sda
/dev/sda:
Timing cached reads: 39666 MB in 2.00 seconds = 19871.21 MB/sec
Timing buffered disk reads: 636 MB in 3.00 seconds = 211.75 MB/sec
[stefano@ceph-test ~]$ sudo dd if=/dev/zero of=/dev/sda bs=1M count=1024 conv=fsync status=progress
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 6.96485 s, 154 MB/s
As we can see times are inline with what resulted on the host.
Installing ceph
First of all we should install some required stuff for it to work
sudo dnf install podman python3 llvm2 nano
We can now bootstrap our (and only) node
[stefano@ceph-test ~]$ curl --silent --remote-name --location https://github.com/ceph/ceph/raw/quincy/src/cephadm/cephadm
[stefano@ceph-test ~]$ chmod +x cephadm
[stefano@ceph-test ~]$ sudo ./cephadm add-repo --release quincy
...
Completed adding repo.
[stefano@ceph-test ~]$ sudo ./cephadm install
[stefano@ceph-test ~]$ sudo cephadm bootstrap --mon-ip 192.168.122.17 --allow-fqdn-hostname --skip-monitoring-stack --single-host-defaults
Where 192.168.122.17 is the current host ip.
You can now access the control dashboard at https://<your ip>:8443/
Enter the ceph shell next:
[stefano@ceph-test ~]$ sudo cephadm shell
...
[ceph: root@ceph-test /]#
Building a standard pool and fs
Once we are into the shell we can now get all the available devices and create the osds.
[ceph: root@ceph-test /]# ceph orch device ls --refresh
HOST PATH TYPE DEVICE ID SIZE AVAILABLE REFRESHED REJECT REASONS
ceph-test /dev/sda hdd ATA_QEMU_HARDDISK_QM00009 8589M Yes 11s ago
ceph-test /dev/sdb hdd ATA_QEMU_HARDDISK_QM00007 8589M Yes 11s ago
ceph-test /dev/sdc hdd ATA_QEMU_HARDDISK_QM00011 8589M Yes 11s ago
ceph-test /dev/sdd hdd ATA_QEMU_HARDDISK_QM00013 6442M Yes 11s ago
ceph-test /dev/sde hdd ATA_QEMU_HARDDISK_QM00015 6442M Yes 11s ago
#Tip: wait for full osc creation and start
[ceph: root@ceph-test /]# ceph orch daemon add osd ceph-test:/dev/sda
[ceph: root@ceph-test /]# ceph orch daemon add osd ceph-test:/dev/sdb
[ceph: root@ceph-test /]# ceph orch daemon add osd ceph-test:/dev/sdc
with the same method create the other two osd and change their class (if not recognized as ssd)
[ceph: root@ceph-test /]# ceph orch daemon add osd ceph-test:/dev/sdd
[ceph: root@ceph-test /]# ceph osd crush rm-device-class osd.3
[ceph: root@ceph-test /]# ceph osd crush set-device-class ssd osd.3
[ceph: root@ceph-test /]# ceph orch daemon add osd ceph-test:/dev/sde
[ceph: root@ceph-test /]# ceph osd crush rm-device-class osd.4
[ceph: root@ceph-test /]# ceph osd crush set-device-class ssd osd.4
Note that osd.3 and osd.4 are the ids ceph assigned to the newly created osds.
Testing out: hdd only replicated 2/2
First create the replication rule
ceph osd crush rule create-replicated replicated_hdd default osd hdd
ceph osd pool create replicated_no_cache_fs replicated replicated_hdd --autoscale-mode=on
ceph osd pool set replicated_no_cache_fs size 2
ceph osd pool set replicated_no_cache_fs min_size 2
Testing raw pool speed
[ceph: root@ceph-test /]# rados bench -p replicated_no_cache_fs 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_ceph-test_475
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
...
14 4 44 40 11.4275 12 5.29316 4.54613
Total time run: 14.1894
Total writes made: 44
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 12.4036
Stddev Bandwidth: 4.35007
Max bandwidth (MB/sec): 20
Min bandwidth (MB/sec): 0
Average IOPS: 3
Stddev IOPS: 1.13873
Max IOPS: 5
Min IOPS: 0
Average Latency(s): 4.56899
Stddev Latency(s): 1.1601
Max latency(s): 6.0285
Min latency(s): 1.27842
12MB/s… pretty crappy, that’s 1/6 of an USB2.0
Testing out: hdd only erasure coded k=2 m=1
ceph osd erasure-code-profile set k2m1 k=2 m=1 crush-failure-domain=osd crush-device-class=hdd
ceph osd crush rule create-erasure erasure_hdd k2m
ceph osd pool create data erasure k2m1 erasure_hdd --autoscale-mode=on
ceph osd pool set data allow_ec_overwrites true
Testing raw pool speed
[ceph: root@ceph-test /]# rados bench -p data 30 write
...
Total time run: 33.455
Total writes made: 121
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 14.4672
Stddev Bandwidth: 4.41416
Max bandwidth (MB/sec): 20
Min bandwidth (MB/sec): 0
Average IOPS: 3
Stddev IOPS: 1.12057
Max IOPS: 5
Min IOPS: 0
Average Latency(s): 4.23906
Stddev Latency(s): 0.678004
Max latency(s): 5.70135
Min latency(s): 1.37831
Cleaning up (deleting benchmark objects)
Removed 121 objects
Clean up completed and total clean up time :2.4025
14MB/s… HOW? It should be slower… I think I’ll keep the 2/2 replication because EC is wess uited on more disks (like k=4 m=2)
Trying cache…
Caching pool 2/1
First of all we need to create the cache pool and the relative crush rule
ceph osd crush rule create-replicated replicated_cache default osd ssd
ceph osd pool create cache replicated replicated_cache --autoscale-mode=on
ceph osd pool set cache size 2
ceph osd pool set cache min_size 1
Now we can add the cache tier
ceph osd tier add <data pool name> cache
ceph osd tier cache-mode cache writeback
ceph osd tier set-overlay replicated_no_cache_fs cache
ceph osd pool set cache hit_set_type bloom
ceph osd pool set cache hit_set_count 12
ceph osd pool set cache hit_set_period 14400
ceph osd pool set cache target_max_bytes <0.8*your ssd capacity> 9600000000
Small parenthesis, testing the cache raw speed
[ceph: root@ceph-test /]# rados bench -p cache 10 write
...
Total time run: 10.1919
Total writes made: 1468
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 576.142
Stddev Bandwidth: 125.174
Max bandwidth (MB/sec): 664
Min bandwidth (MB/sec): 288
Average IOPS: 144
Stddev IOPS: 31.2936
Max IOPS: 166
Min IOPS: 72
Average Latency(s): 0.110211
Stddev Latency(s): 0.0341686
Max latency(s): 0.231888
Min latency(s): 0.053774
Cleaning up (deleting benchmark objects)
Removed 1468 objects
Clean up completed and total clean up time :0.234817
A good amount of cpu was used and I’m pretty sure the process was cpu bottleneck’d
Testing the final speed
[ceph: root@ceph-test /]# rados bench -p replicated_no_cache_fs 10 write
...
Total time run: 10.1919
Total writes made: 1468
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 576.142
Stddev Bandwidth: 125.174
Max bandwidth (MB/sec): 664
Min bandwidth (MB/sec): 288
Average IOPS: 144
Stddev IOPS: 31.2936
Max IOPS: 166
Min IOPS: 72
Average Latency(s): 0.110211
Stddev Latency(s): 0.0341686
Max latency(s): 0.231888
Min latency(s): 0.053774
Cleaning up (deleting benchmark objects)
Removed 1468 objects
Clean up completed and total clean up time :0.234817
Deleting pools
Pool deletion is disabled by default. activate it by
ceph tell mon.\* injectargs '--mon-allow-pool-delete=true'
you can now delete the pool
ceph osd pool delete <name> <name> --yes-i-really-really-mean-it
Configuring a file system
ceph osd pool create metadata 128 replicated replicated_hdd
ceph osd pool set metadata size 3
ceph osd pool set metadata min_size 2
ceph osd pool application enable data cephfs
ceph osd pool application enable metadata cephfs
ceph orch apply mds ceph-test
ceph orch apply nfs ceph-test
ceph fs new data_fs metadata data
now you can deploy a NFS export from the dashboard and mount on a client computer.
Testing speeds
sudo mount -t nfs 192.168.122.17:/ /mnt
sudo dd if=/dev/zero of=/mnt/data/test.bin bs=1M count=2048 conv=fsync status=progress
...
1206910976 bytes (1,2 GB, 1,1 GiB) copied, 4 s, 296 MB/s
1212153856 bytes (1,2 GB, 1,1 GiB) copied, 5 s, 237 MB/s
1217396736 bytes (1,2 GB, 1,1 GiB) copied, 6 s, 198 MB/s
1221591040 bytes (1,2 GB, 1,1 GiB) copied, 7 s, 174 MB/s
1226833920 bytes (1,2 GB, 1,1 GiB) copied, 8 s, 152 MB/s
1233125376 bytes (1,2 GB, 1,1 GiB) copied, 9 s, 137 MB/s
...
2048+0 records in
2048+0 records out
2147483648 bytes (2,1 GB, 2,0 GiB) copied, 119,608 s, 18,0 MB/s
As you can see, the copy starts pretty fast and the speeds decays because of buffers and similar things.
Oh no, A disk is gone!
Simply delete the osd, change the disk and recreate it. Ceph will take care of moving and rewriting missing pgs
ceph osd out osd.0
ceph orch osd rm 0
ceph osd purge 0 --yes-i-really-mean-it
ceph orch daemon rm osd.0 --force
ceph orch device ls --refresh
ceph orch daemon add osd localhost.localdomain:/dev/sdf
Once the osd has been recreated, data will start to flow into the disk. remember to change the disk class if you’re in a virtual machine!
Expanding
Add a new osd and the crushrules will do the math and move things around
ceph orch daemon add osd localhost.localdomain:/dev/sdf
Another way of expanding is to remove an existing disk, wait for cluster rebalance and then add a new bigger disk. ceph should handle this ssituation but you’ll get a disk with more IOPS and more data into it. Be sure that all the data in the bigger disck can be moved to the others!
Update
I’m writing this update after some testing and this is what has gone wrong or needed some tweaks:
- I’ve expanded the RAM amount for the VM and set it at 8GB because things can get pretty slow pretty fast when using caching and rebuilding.
- Also I made a small tweak to ceph configuration from the web ui by setting osd_memory_target to 1GB instead of 4GB… this was needed because 5 osd * 4 GB > 8GB and things went bad once while testing expansion capabilities.
Additional resources
- https://blog.devgenius.io/ceph-install-single-node-cluster-55b21e6fdaa2
- https://balderscape.medium.com/setting-up-a-virtual-single-node-ceph-storage-cluster-d86d6a6c658e
- https://linoxide.com/hwto-configure-single-node-ceph-cluster/
- https://vineetcic.medium.com/how-to-remove-add-osd-from-ceph-cluster-1c038eefe522
- https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html/operations_guide/management-of-osds-using-the-ceph-orchestrator#removing-the-osd-daemons-using-the-ceph-orchestrator_ops
- https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/operations_guide/handling-a-disk-failure#replacing-a-failed-osd-disk_ops
- https://arpnetworks.com/blog/2019/06/28/how-to-update-the-device-class-on-a-ceph-osd.html
- https://docs.ceph.com/en/latest/rados/operations/erasure-code/
- https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/ZJCYFAIUSPJGJFDIMVYOZ4K4AAM2BLL7/