DFS on a Diet: Enabling Reduction Schemes on Distributed File Systems
TimeThursday, 19 November 20208:30am - 5pm EST
DescriptionThe selection of data reduction schemes, crucial for data footprints on a distributed file system (DFS) and for transferring big data, is usually limited to the schemes supported by the underlying platforms. If the platform's source code is available, it might be possible to add user-favorite reduction schemes, but it requires expensive implementation costs or is virtually impossible. We propose a system design that links a DFS to reduction schemes and enables them transparently to data processing applications. We implemented a framework within Hadoop DFS (HDFS) named Hadoop Data Reduction Framework (HDRF). The features of HDRF are: users can easily incorporate their favorite schemes with reasonably restrained implementation costs, the selection is transparent to data processing applications, and experimental results show HDRF has low processing and storage overhead and can halve the vanilla HDFS transfer time by using a more optimized application, without compromising the compression ratio.