Release notes for Gluster 3.12.0

This is a major Gluster release that includes, ability to mount sub-directories using the Gluster native protocol (FUSE), further brick multiplexing enhancements that help scale to larger brick counts per node, enhancements to gluster get-state CLI enabling better understanding of various bricks and nodes participation/roles in the cluster, ability to resolve GFID split-brain using existing CLI, easier GFID to real path mapping thus enabling easier diagnostics and correction for reported GFID issues (healing among other uses where GFID is the only available source for identifying a file), and other changes and fixes.

The most notable features and changes are documented on this page. A full list of bugs that have been addressed is included further below.

Further, as 3.11 release is a short term maintenance release, features included in that release are available with 3.12 as well, and could be of interest to users upgrading to 3.12 from older than 3.11 releases. The 3.11 release notes captures the list of features that were introduced with 3.11.

Major changes and features

Ability to mount sub-directories using the Gluster FUSE protocol

Notes for users:

With this release, it is possible define sub-directories to be mounted by specific clients and additional granularity in the form of clients to mount only that portion of the volume for access.

Until recently, Gluster FUSE mounts enabled mounting the entire volume on the client. This feature helps sharing a volume among the multiple consumers along with enabling restricting access to the sub-directory of choice.

Option controlling sub-directory allow/deny rules can be set as follows:

# gluster volume set <volname> auth.allow "/subdir1(192.168.1.*),/(192.168.10.*),/subdir2(192.168.8.*)"

How to mount from the client:

# mount -t glusterfs <hostname>:/<volname>/<subdir> /<mount_point>

Or,

# mount -t glusterfs <hostname>:/<volname> -osubdir_mount=<subdir> /<mount_point>

Limitations:

  • There are no throttling or QoS support for this feature. The feature will just provide the namespace isolation for the different clients.

Known Issues:

  • Once we cross more than 1000s of subdirs in 'auth.allow' option, the performance of reconnect / authentication would be impacted.

GFID to path conversion is enabled by default

Notes for users:

Prior to this feature, only when quota was enabled, did the on disk data have pointers back from GFID to their respective filenames. As a result, if there were a need to locate the path given a GFID, quota had to be enabled.

The change brought in by this feature, is to enable this on disk data to be present, for all cases, than just quota. Further, enhancements here have been to improve the manner of storing this information on disk as extended attributes.

The internal on disk xattr that is now stored to reference the filename and parent for a GFID is, trusted.gfid2path.<xxhash>

This feature is enabled by default with this release.

Limitations:

None

Known Issues:

None

Various enhancements have been made to the output of get-state CLI command

Notes for users:

The command #gluster get-state has been enhanced to output more information as below, - Arbiter bricks are marked more clearly in a volume that has the feature enabled - Ability to get all volume options (both set and defaults) in the get-state output - Rebalance time estimates, for ongoing rebalance, is captured in the get-state output - If geo-replication is configured, then get-state now captures the session details of the same

Limitations:

None

Known Issues:

None

Provided an option to set a limit on number of bricks multiplexed in a processes

Notes for users:

This release includes a global option to be switched on only if brick multiplexing is enabled for the cluster. The introduction of this option allows the user to control the number of bricks that are multiplexed in a process on a node. If the limit set by this option is insufficient for a single process, more processes are spawned for the subsequent bricks.

Usage:

#gluster volume set all cluster.max-bricks-per-process <value>

Provided an option to use localtime timestamps in log entries

Limitations:

Gluster defaults to UTC timestamps. glusterd, glusterfsd, and server-side glusterfs daemons will use UTC until one of, 1. command line option is processed, 2. gluster config (/var/lib/glusterd/options) is loaded, 3. admin manually sets localtime-logging (cluster.localtime-logging, e.g. #gluster volume set all cluster.localtime-logging enable).

There is no mount option to make the FUSE client enable localtime logging.

There is no option in gfapi to enable localtime logging.

Enhanced the option to export statfs data for bricks sharing the same backend filesystem

Notes for users: In the past 'storage/posix' xlator had an option called option export-statfs-size, which, when set to 'no', exports zero as values for few fields in struct statvfs. These are typically reflected in an output of df command, from a user perspective.

When backend bricks are shared between multiple brick processes, the values of these variables have been corrected to reflect field_value / number-of-bricks-at-node. Thus enabling better usage reporting and also enhancing the ability for file placement in the distribute translator when used with the option min-free-disk.

Provided a means to resolve GFID split-brain using the gluster CLI

Notes for users:

The existing CLI commands to heal files under split-brain did not handle cases where there was a GFID mismatch between the files. With the provided enhancement the same CLI commands can now address GFID split-brain situations based on the choices provided.

The CLI options that are enhanced to help with this situation are,

volume heal <VOLNAME> split-brain {bigger-file <FILE> |
    latest-mtime <FILE> |
    source-brick <HOSTNAME:BRICKNAME> [<FILE>]}

Limitations:

None

Known Issues:

None

Notes for developers:

NOTE: Also relevant for users building from sources and needing different defaults for some options

Most people consume Gluster in one of two ways: * From packages provided by their OS/distribution vendor * By building themselves from source

For the first group it doesn't matter whether configuration is done in a configure script, via command-line options to that configure script, or in a header file. All of these end up as edits to some file under the packager's control, which is then run through their tools and process (e.g. rpmbuild) to create the packages that users will install.

For the second group, convenience matters. Such users might not even have a script wrapped around the configure process, and editing one line in a header file is a lot easier than editing several in the configure script. This also prevents a messy profusion of configure options, dozens of which might need to be added to support a single such user's preferences. This comes back around as greater simplicity for packagers as well. This patch defines site.h as the header file for options and parameters that someone building the code for themselves might want to tweak.

The project ships one version to reflect the developers' guess at the best defaults for most users, and sophisticated users with unusual needs can override many options at once just by maintaining their own version of that file. Further guidelines for how to determine whether an option should go in configure.ac or site.h are explained within site.h itself.

Notes for developers:

Function gf_xxh64_wrapper has been added as a wrapper into libglusterfs for consumption by interested developers.

Reference to code can be found here

Notes for users:

glfs_ipc API was maintained as a public API in the GFAPI libraries. This has been removed as a public interface, from this release onwards.

Any application, written directly to consume gfapi as a means of interfacing with Gluster, using the mentioned API, would need to be modified to adapt to this change.

NOTE: As of this release there are no known public consumers of this API

Major issues

  1. Expanding a gluster volume that is sharded may cause file corruption
    • Sharded volumes are typically used for VM images, if such volumes are expanded or possibly contracted (i.e add/remove bricks and rebalance) there are reports of VM images getting corrupted.
    • The last known cause for corruption (Bug #1465123) has a fix with this release. As further testing is still in progress, the issue is retained as a major issue.
    • Status of this bug can be tracked here, #1465123

Bugs addressed

Bugs addressed since release-3.11.0 are listed below.

  • #1047975: glusterfs/extras: add a convenience script to label (selinux) gluster bricks
  • #1254002: [RFE] Have named pthreads for easier debugging
  • #1318100: RFE : SELinux translator to support setting SELinux contexts on files in a glusterfs volume
  • #1318895: Heal info shows incorrect status
  • #1326219: Make Gluster/NFS an optional component
  • #1356453: DHT: slow readdirp performance
  • #1366817: AFR returns the node uuid of the same node for every file in the replica
  • #1381970: GlusterFS Daemon stops working after a longer runtime and higher file workload due to design flaws?
  • #1400924: [RFE] Rsync flags for performance improvements
  • #1402406: Client stale file handle error in dht-linkfile.c under SPEC SFS 2014 VDA workload
  • #1414242: [whql][virtio-block+glusterfs]"Disk Stress" and "Disk Verification" job always failed on win7-32/win2012/win2k8R2 guest
  • #1421938: systemic testing: seeing lot of ping time outs which would lead to splitbrains
  • #1424817: Fix wrong operators, found by coverty
  • #1428061: Halo Replication feature for AFR translator
  • #1428673: possible repeatedly recursive healing of same file with background heal not happening when IO is going on
  • #1430608: [RFE] Pass slave volume in geo-rep as read-only
  • #1431908: Enabling parallel-readdir causes dht linkto files to be visible on the mount,
  • #1433906: quota: limit-usage command failed with error " Failed to start aux mount"
  • #1437748: Spacing issue in fix-layout status output
  • #1438966: Multiple bricks WILL crash after TCP port probing
  • #1439068: Segmentation fault when creating a qcow2 with qemu-img
  • #1442569: Implement Negative lookup cache feature to improve create performance
  • #1442788: Cleanup timer wheel in glfs_fini()
  • #1442950: RFE: Enhance handleops readdirplus operation to return handles along with dirents
  • #1444596: [Brick Multiplexing] : Bricks for multiple volumes going down after glusterd restart and not coming back up after volume start force
  • #1445609: [perf-xlators/write-behind] write-behind-window-size could be set greater than its allowed MAX value 1073741824
  • #1446172: Brick Multiplexing :- resetting a brick bring down other bricks with same PID
  • #1446362: cli xml status of detach tier broken
  • #1446412: error-gen don't need to convert error string to int in every fop
  • #1446516: [Parallel Readdir] : Mounts fail when performance.parallel-readdir is set to "off"
  • #1447116: gfapi exports non-existing glfs_upcall_inode_get_event symbol
  • #1447266: [snapshot cifs]ls on .snaps directory is throwing input/output error over cifs mount
  • #1447389: Brick Multiplexing: seeing Input/Output Error for .trashcan
  • #1447609: server: fd should be refed before put into fdtable
  • #1447630: Don't allow rebalance/fix-layout operation on sharding enabled volumes till dht+sharding bugs are fixed
  • #1447826: potential endless loop in function glusterfs_graph_validate_options
  • #1447828: Should use dict_set_uint64 to set fd->pid when dump fd's info to dict
  • #1447953: Remove inadvertently merged IPv6 code
  • #1447960: [Tiering]: High and low watermark values when set to the same level, is allowed
  • #1447966: 'make cscope' fails on a clean tree due to missing generated XDR files
  • #1448150: USS: stale snap entries are seen when activation/deactivation performed during one of the glusterd's unavailability
  • #1448265: use common function iov_length to instead of duplicate code
  • #1448293: Implement FALLOCATE FOP for EC
  • #1448299: Mismatch in checksum of the image file after copying to a new image file
  • #1448364: limited throughput with disperse volume over small number of bricks
  • #1448640: Seeing error "Failed to get the total number of files. Unable to estimate time to complete rebalance" in rebalance logs
  • #1448692: use GF_ATOMIC to generate callid
  • #1448804: afr: include quorum type and count when dumping afr priv
  • #1448914: [geo-rep]: extended attributes are not synced if the entry and extended attributes are done within changelog roleover/or entry sync
  • #1449008: remove useless options from glusterd's volume set table
  • #1449232: race condition between client_ctx_get and client_ctx_set
  • #1449329: When either killing or restarting a brick with performance.stat-prefetch on, stat sometimes returns a bad st_size value.
  • #1449348: disperse seek does not correctly handle the end of file
  • #1449495: glfsheal: crashed(segfault) with disperse volume in RDMA
  • #1449610: [New] - Replacing an arbiter brick while I/O happens causes vm pause
  • #1450010: [gluster-block]:Need a volume group profile option for gluster-block volume to add necessary options to be added.
  • #1450559: Error 0-socket.management: socket_poller XX.XX.XX.XX:YYY failed (Input/output error) during any volume operation
  • #1450630: [brick multiplexing] detach a brick if posix health check thread complaints about underlying brick
  • #1450730: Add tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t to bad tests
  • #1450975: Fix on demand file migration from client
  • #1451083: crash in dht_rmdir_do
  • #1451162: dht: Make throttle option "normal" value uniform across dht_init and dht_reconfigure
  • #1451248: Brick Multiplexing: On reboot of a node Brick multiplexing feature lost on that node as multiple brick processes get spawned
  • #1451588: [geo-rep + nl]: Multiple crashes observed on slave with "nlc_lookup_cbk"
  • #1451724: glusterfind pre crashes with "UnicodeDecodeError: 'utf8' codec can't decode" error when the --no-encode is used
  • #1452006: tierd listens to a port.
  • #1452084: [Ganesha] : Stale linkto files after unsuccessfuly hardlinks
  • #1452102: [DHt] : segfault in dht_selfheal_dir_setattr while running regressions
  • #1452378: Cleanup unnecessary logs in fix_quorum_options
  • #1452527: Shared volume doesn't get mounted on few nodes after rebooting all nodes in cluster.
  • #1452956: glusterd on a node crashed after running volume profile command
  • #1453151: [RFE] glusterfind: add --end-time and --field-separator options
  • #1453977: Brick Multiplexing: Deleting brick directories of the base volume must gracefully detach from glusterfsd without impacting other volumes IO(currently seeing transport end point error)
  • #1454317: [Bitrot]: Brick process crash observed while trying to recover a bad file in disperse volume
  • #1454375: ignore incorrect uuid validation in gd_validate_mgmt_hndsk_req
  • #1454418: Glusterd segmentation fault in ' _Unwind_Backtrace' while running peer probe
  • #1454701: DHT: Pass errno as an argument to gf_msg
  • #1454865: [Brick Multiplexing] heal info shows the status of the bricks as "Transport endpoint is not connected" though bricks are up
  • #1454872: [Geo-rep]: Make changelog batch size configurable
  • #1455049: [GNFS+EC] Unable to release the lock when the other client tries to acquire the lock on the same file
  • #1455104: dht: dht self heal fails with no hashed subvol error
  • #1455179: [Geo-rep]: Log time taken to sync entry ops, metadata ops and data ops for each batch
  • #1455301: gluster-block is not working as expected when shard is enabled
  • #1455559: [Geo-rep]: METADATA errors are seen even though everything is in sync
  • #1455831: libglusterfs: updates old comment for 'arena_size'
  • #1456361: DHT : for many operation directory/file path is '(null)' in brick log
  • #1456385: glusterfs client crash on io-cache.so(__ioc_page_wakeup+0x44)
  • #1456405: Brick Multiplexing:dmesg shows request_sock_TCP: Possible SYN flooding on port 49152 and memory related backtraces
  • #1456582: "split-brain observed [Input/output error]" error messages in samba logs during parallel rm -rf
  • #1456653: nlc_lookup_cbk floods logs
  • #1456898: Regression test for add-brick failing with brick multiplexing enabled
  • #1457202: Use of force with volume start, creates brick directory even it is not present
  • #1457808: all: spelling errors (debian package maintainer)
  • #1457812: extras/hook-scripts: non-portable shell syntax (debian package maintainer)
  • #1457981: client fails to connect to the brick due to an incorrect port reported back by glusterd
  • #1457985: Rebalance estimate time sometimes shows negative values
  • #1458127: Upcall missing invalidations
  • #1458193: Implement seek() fop in trace translator
  • #1458197: io-stats usability/performance statistics enhancements
  • #1458539: [Negative Lookup]: negative lookup features doesn't seem to work on restart of volume
  • #1458582: add all as volume option in gluster volume get usage
  • #1458768: [Perf] 35% drop in small file creates on smbv3 on *2
  • #1459402: brick process crashes while running bug-1432542-mpx-restart-crash.t in a loop
  • #1459530: [RFE] Need a way to resolve gfid split brains
  • #1459620: [geo-rep]: Worker crashed with TypeError: expected string or buffer
  • #1459781: Brick Multiplexing:Even clean Deleting of the brick directories of base volume is resulting in posix health check errors(just as we see in ungraceful delete methods)
  • #1459971: posix-acl: Whitelist virtual ACL xattrs
  • #1460225: Not cleaning up stale socket file is resulting in spamming glusterd logs with warnings of "got disconnect from stale rpc"
  • #1460514: [Ganesha] : Ganesha crashes while cluster enters failover/failback mode
  • #1460585: Revert CLI restrictions on running rebalance in VM store use case
  • #1460638: ec-data-heal.t fails with brick mux enabled
  • #1460659: Avoid one extra call of l(get|list)xattr system call after use buffer in posix_getxattr
  • #1461129: malformed cluster.server-quorum-ratio setting can lead to split brain
  • #1461648: Update GlusterFS README
  • #1461655: glusterd crashes when statedump is taken
  • #1461792: lk fop succeeds even when lock is not acquired on at least quorum number of bricks
  • #1461845: [Bitrot]: Inconsistency seen with 'scrub ondemand' - fails to trigger scrub
  • #1462200: glusterd status showing failed when it's stopped in RHEL7
  • #1462241: glusterfind: syntax error due to uninitialized variable 'end'
  • #1462790: with AFR now making both nodes to return UUID for a file will result in georep consuming more resources
  • #1463178: [Ganesha]Bricks got crashed while running posix compliance test suit on V4 mount
  • #1463365: Changes for Maintainers 2.0
  • #1463648: Use GF_XATTR_LIST_NODE_UUIDS_KEY to figure out local subvols
  • #1464072: cns-brick-multiplexing: brick process fails to restart after gluster pod failure
  • #1464091: Regression: Heal info takes longer time when a brick is down
  • #1464110: [Scale] : Rebalance ETA (towards the end) may be inaccurate,even on a moderately large data set.
  • #1464327: glusterfs client crashes when reading large directory
  • #1464359: selfheal deamon cpu consumption not reducing when IOs are going on and all redundant bricks are brought down one after another
  • #1465024: glusterfind: DELETE path needs to be unquoted before further processing
  • #1465075: Fd based fops fail with EBADF on file migration
  • #1465214: build failed with GF_DISABLE_MEMPOOL
  • #1465559: multiple brick processes seen on gluster(fs)d restart in brick multiplexing
  • #1466037: Fuse mount crashed with continuous dd on a file and reading the file in parallel
  • #1466110: dht_rename_lock_cbk crashes in upstream regression test
  • #1466188: Add scripts to analyze quota xattr in backend and identify accounting issues
  • #1466785: assorted typos and spelling mistakes from Debian lintian
  • #1467209: [Scale] : Rebalance ETA shows the initial estimate to be ~140 days,finishes within 18 hours though.
  • #1467277: [GSS] [RFE] add documentation on --xml and --mode=script options to gluster interactive help and man pages
  • #1467313: cthon04 can cause segfault in gNFS/NLM
  • #1467513: CIFS:[USS]: .snaps is not accessible from the CIFS client after volume stop/start
  • #1467718: [Geo-rep]: entry failed to sync to slave with ENOENT errror
  • #1467841: gluster volume status --xml fails when there are 100 volumes
  • #1467986: possible memory leak in glusterfsd with multiplexing
  • #1468191: Enable stat-prefetch in group virt
  • #1468261: Regression: non-disruptive(in-service) upgrade on EC volume fails
  • #1468279: metadata heal not happening despite having an active sink
  • #1468291: NFS Sub directory is getting mounted on solaris 10 even when the permission is restricted in nfs.export-dir volume option
  • #1468432: tests: fix stats-dump.t failure
  • #1468433: rpc: include current second in timed out frame cleanup on client
  • #1468863: Assert in mem_pools_fini during libgfapi-fini-hang.t on NetBSD
  • #1469029: Rebalance hangs on remove-brick if the target volume changes
  • #1469179: invoke checkpatch.pl with strict
  • #1469964: cluster/dht: Fix hardlink migration failures
  • #1470170: mem-pool: mem_pool_fini() doesn't release entire memory allocated
  • #1470220: glusterfs process leaking memory when error occurs
  • #1470489: bulk removexattr shouldn't allow removal of trusted.gfid/trusted.glusterfs.volume-id
  • #1470533: Brick Mux Setup: brick processes(glusterfsd) crash after a restart of volume which was preceded with some actions
  • #1470768: file /usr/lib64/glusterfs/3.12dev/xlator is not owned by any package
  • #1471790: [Brick Multiplexing] : cluster.brick-multiplex has no description.
  • #1472094: Test script failing with brick multiplexing enabled
  • #1472250: Remove fop_enum_to_string, get_fop_int usage in libglusterfs
  • #1472417: No clear method to multiplex all bricks to one process(glusterfsd) with cluster.max-bricks-per-process option
  • #1472949: [distribute] crashes seen upon rmdirs
  • #1475181: dht remove-brick status does not indicate failures files not migrated because of a lack of space
  • #1475192: [Scale] : Rebalance ETA shows the initial estimate to be ~140 days,finishes within 18 hours though.
  • #1475258: [Geo-rep]: Geo-rep hangs in changelog mode
  • #1475399: Rebalance estimate time sometimes shows negative values
  • #1475635: [Scale] : Client logs flooded with "inode context is NULL" error messages
  • #1475641: gluster core dump due to assert failed GF_ASSERT (brick_index < wordcount);
  • #1475662: [Scale] : Rebalance Logs are bulky.
  • #1476109: Brick Multiplexing: Brick process crashed at changetimerecorder(ctr) translator when restarting volumes
  • #1476208: [geo-rep]: few of the self healed hardlinks on master did not sync to slave
  • #1476653: cassandra fails on gluster-block with both replicate and ec volumes
  • #1476654: gluster-block default shard-size should be 64MB
  • #1476819: scripts: invalid test in S32gluster_enable_shared_storage.sh
  • #1476863: packaging: /var/lib/glusterd/options should be %config(noreplace)
  • #1476868: [EC]: md5sum mismatches every time for a file from the fuse client on EC volume
  • #1477152: [Remove-brick] Few files are getting migrated eventhough the bricks crossed cluster.min-free-disk value
  • #1477190: [GNFS] GNFS got crashed while mounting volume on solaris client
  • #1477381: Revert experimental and 4.0 features to prepare for 3.12 release
  • #1477405: eager-lock should be off for cassandra to work at the moment
  • #1477994: [Ganesha] : Ganesha crashes while cluster enters failover/failback mode
  • #1478276: separating attach tier and add brick
  • #1479118: AFR entry self heal removes a directory's .glusterfs symlink.
  • #1479263: nfs process crashed in "nfs3svc_getattr"
  • #1479303: [Perf] : Large file sequential reads are off target by ~38% on FUSE/Ganesha
  • #1479474: Add NULL gfid checks before creating file
  • #1479655: Permission denied errors when appending files after readdir
  • #1479662: when gluster pod is restarted, bricks from the restarted pod fails to connect to fuse, self-heal etc
  • #1479717: Running sysbench on vm disk from plain distribute gluster volume causes disk corruption
  • #1480448: More useful error - replace 'not optimal'
  • #1480459: Gluster puts PID files in wrong place
  • #1481931: [Scale] : I/O errors on multiple gNFS mounts with "Stale file handle" during rebalance of an erasure coded volume.
  • #1482804: Negative Test: glusterd crashes for some of the volume options if set at cluster level
  • #1482835: glusterd fails to start
  • #1483402: DHT: readdirp fails to read some directories.
  • #1483996: packaging: use rdma-core(-devel) instead of ibverbs, rdmacm; disable rdma on armv7hl
  • #1484440: packaging: /run and /var/run; prefer /run
  • #1484885: [rpc]: EPOLLERR - disconnecting now messages every 3 secs after completing rebalance
  • #1486107: /var/lib/glusterd/peers File had a blank line, Stopped Glusterd from starting
  • #1486110: [quorum]: Replace brick is happened when Quorum not met.
  • #1486120: symlinks trigger faulty geo-replication state (rsnapshot usecase)
  • #1486122: gluster-block profile needs to have strict-o-direct