Submitted:
23 January 2024
Posted:
23 January 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
- We identify the limitations of the current Linux readahead scheme for SSD-based caching storage systems, which allows us to design a new readahead architecture for SSD-based caching storage systems.
- We propose a novel architecture that addresses the limitations of the current Linux readahead scheme. The new readahead architecture is cross-layered: it communicates with VFS layer, file system layer, and block I/O layer to obtain information for prefetching. First, it communicates with the VFS layer to determine when to trigger a readahead, i.e., readahead timing. It also cooperates with the file system to determine whether the readahead data is continuous or fragmented, i.e., readahead data continuity. Finally, it queries the SSD cache manager to determine whether the accessed data is cached in the SSD or stored in the HDD, i.e., readahead data location. Then, according to the degree of data access sequentiality, the performance model of the corresponding storage device, and the data access patterns applied on the corresponding storage device, the proposed architecture determines whether to invoke prefetching and, if so, calculates an appropriate prefetch depth (also known as prefetch degree or prefetch size).
- We present a comprehensive design and implementation of our new architecture in Linux. The experimental results reveal that the architecture improves the performance of the current Linux prefetching scheme. In particular, it reduces the total execution time of the stock Linux kernel by up to 49%.
2. Background
2.1. SSD-based Caching Storage Systems
2.2. Linux Readahead Scheme
- Cache miss page. When a user process issues a read request, Linux first searches the page cache. If the requested data is not in the page cache, i.e., this is a cache miss, the request is forwarded to the underlying storage device. Simultaneously, Linux performs a synchronous readahead that issues a much larger request than the original request to pull in more pages from the storage device.
- PG_readahead page. It is inefficient to trigger a prefetch only when a page cache miss occurs because the corresponding user process must be blocked until the requested data has been read, cached in the page cache, and copied to the user buffer. Based on the concept of asynchronous I/O, Linux addresses this using asynchronous readahead by proactively launching a readahead to the storage devices. Figure 2 shows that, in a sequential access stream, a synchronous readahead is followed by a number of asynchronous readaheads. Specifically, Linux introduces the async_size threshold. When the number of not-yet-referenced readahead pages falls below async_size, which means that the user process has consumed a sufficient number of readahead pages, an asynchronous readahead commences. A page is called a readahead page if it was fetched from the storage devices by the readahead algorithm. Thus, as shown in Figure 3, in a set of readahead pages, Linux sets a trigger page whose location is at a trigger distance of async_size from the end of the set of readahead pages. Linux marks the trigger page with the PG_readahead flag. When a user process accesses the page marked with PG_readahead, indicating that the readahead pages will soon be used up, Linux proactively issues an asynchronous readahead to pull in more pages. Thus, when users issue requests to read these pages, the requests are read immediately from the page cache instead of visiting the storage devices.
- Synchronous readahead. Depending on the size of the user read request, the readahead size is two or four times the user read size. Equation (1) calculates async_size :
- Asynchronous readahead. Depending on the size of the previous readahead size, the readahead size is also set to two or four times the previous readahead size. Thus, the readahead size grows exponentially until it reaches the maximum readahead size, denoted by max_readahead. Besides, Equation (2) calculates async_size size. From Equation (2), async_size is set to equal the readahead size. For example, as shown in Figure 4(b), if the readahead size is 4 pages, then async_size = 4. Thus, if the upcoming read request is sequential, the next asynchronous readahead starts upon accessing the first readahead page.
2.3. Linux I/O Stack
3. Motivations
- The performance models of HDDs and SSDs are drastically different. Since SSDs have no mechanical moving parts compared to HDDs, they provide much better performance, especially for read operations. In addition, HDD sequential reads are much faster than HDD random reads. By contrast, in SSDs the performance of sequential reads and sequential writes are similar.
- The I/O workloads applied to HDDs and SSDs are also different. Since SSDs and HDDs have different performance models, an SSD cache manager should exploit the complementary properties of SSDs and HDDs. For example, bcache and profit caching differentiate random accesses from sequential accesses and only cache randomly accessed data in SSDs [8,11]. Therefore, data cached in SSDs exhibits different access patterns compared to data stored in HDDs.
4. Design and Implementation
4.1. System Architecture
- Readahead Timing. A readahead solution must know when to trigger a prefetch. As stated above, a readahead should be invoked when cache miss pages or PG_readahead pages are accessed.
- Readahead Data Continuity. A readahead solution must know whether the data blocks of a file are located continuously or fragmented. If a file is fragmented, it must also know the LBNs of each file segments, so as to skip LBNs that belong to other files.
- Readahead Data Location. To apply different prefetching policies for SSDs and HDDs, a readahead solution must know whether an accessed data block is cached in the SSD or stored in the HDD.
4.2. ReqInterceptor
4.3. ReqAnalyzer
4.4. RaHandler
- Case 1. Data to be prefetched are all stored in the HDD. For the HDD, the additional readahead cost is smaller since the seek time, the dominating time in serving a disk request, can be eliminated. Furthermore, as stated in Section 2.1, in bcache, if data to be prefetched is located in the HDD, it may have a high degree of sequentiality. In this case, RaHandler adopts an aggressive prefetching policy.
- Case 2. Data to be prefetched is all cached in the SSD. In this case, a read from flash memory can be completed in a few microseconds, compared to the HDD’s latency of several milliseconds. Thus, there is little the prefetching benefit with SSDs due to the small performance gap between the SSD and main memory [29]. Furthermore, in bcache, SSD workloads tend to exhibit random access behavior. For these two reasons, RaHandler adopts a conservative prefetching policy.
- Case 3. Data to be prefetched is located in both the SSD and the HDD. This case also exhibits random access behavior. Thus, RaHandler adopts a conservative prefetching policy.
4.4.1. Synchronous Readahead
4.4.2. Asynchronous Readahead
4.4.3. Maximum Readahead Size
5. Experimental Study
5.1. Experimental Environment
- flexible IO tester (fio) is a workload simulation tool that simulates various types of I/O workloads [31].
- grep is a command-line utility that searches plain-text data sets for lines containing a match to a given regular expression. It visits files essentially in the order of the disk layout. We ran grep under two different datasets: the source-code tree of Linux kernel 5.5.10, and a 100 MB text file. The latter was chosen because the Linux source-code tree mostly consists of small files. This utility can be beneficial in IoT systems for searching specific patterns.
- diff is a data comparison utility that calculates and displays character-wise differences between two files. Instead of visiting files in the order of the disk layout, it scans files in alphabetical order of file names. The datasets used to generate grep workloads were also used for diff. This utility can simulate data comparison workloads in IoT systems.
- gcc is the official compiler of Unix-based operating systems. We ran the gcc benchmark by compiling and building the 5.5.10 Linux kernel. This workload is adopted to simulate application development on central servers within IoT systems
- postmark is a representative and popular storage-related benchmarking tool [32] that simulates a workload composed of many short-lived, related small files typically seen in Internet applications such as web-based transaction servers.
- tpc-h is also a popular benchmark [33]. It is a decision support benchmark that supports 22 business-oriented ad-hoc queries with each query having a high degree of complexity with various I/O characteristics. This benchmark is chosen to simulate workloads that involve examining large volumes of data, executing highly complex queries, and delivering answers in IoT systems.
5.2. Experimental Results
6. Related Work
6.1. HDD-based Prefetching Algorithms
6.2. SSD-based Prefetching Algorithms
6.3. Hint-based Prefetching Algorithms
7. Conclusion
References
- Al-Ali AR, Beheiry S, Alnabulsi A, Obaid S, Mansoor N, Odeh N, Mostafa A. An IoT-Based Road Bridge Health Monitoring and Warning System. Sensors. 2024; 24(2):469. [CrossRef]
- Barros N, Sobral P, Moreira RS, Vargas J, Fonseca A, Abreu I, Guerreiro MS. SchoolAIR: A Citizen Science IoT Framework Using Low-Cost Sensing for Indoor Air Quality Management. Sensors. 2024; 24(1):148. [CrossRef]
- Shaheen A, Kazim H, Eltawil M, Aburukba R. IoT-Based Solution for Detecting and Monitoring Upper Crossed Syndrome. Sensors. 2024; 24(1):135. [CrossRef]
- Elfaki AO, Messoudi W, Bushnag A, Abuzneid S, Alhmiedat T. A Smart Real-Time Parking Control and Monitoring System. Sensors. 2023; 23(24):9741. [CrossRef]
- R. Appuswamy, D. C. Moolenbroek, A. S. Tanenbaum, “Integrating flash-based ssds into the storage stack. in Proc. IEEE Mass Storage Syst. and Technol. Conf. (MSST), April 2012. [CrossRef]
- E. Shriver, C. Small, and K. A. Smith, “Why does file system prefetching work?” in Proc. USENIX Annu. Tech. Conf., 1999, pp. 71-84.
- dm-cache. https://web.archive.org/web/20140718083340/.
- http://visa.cs.fiu.edu/tiki/dm-cache.
- bcache. https://bcache.evilpiepirate.org/.
- EnhanceIO. https://github.com/stec-inc/EnhanceIO.
- H. P. Chang and C. P. Chiang, “PARC: a novel OS cache manager,” Softw Pract Exper, Vol. 48, no. 12, pp. 2193-2222, 2018. [CrossRef]
- H. P. Chang, S. Y. Liao, D. W. Chang, G. W. Chen, “Profit data caching and hybrid disk-aware completely fair queuing scheduling algorithms for hybrid disks,” Softw Pract Exper, vol. 45, no. 9, pp. 1229-1249, 2015. [CrossRef]
- S. Ahmadian, R. Salkhordeh, and H. Asadi, “Lbica: A load balancer for i/o cache architectures,” in Proc. 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2019, 1196–1201. [CrossRef]
- L. Yin, L. Wang, Y. Zhang, and Y. Peng, “NUDTMapperX: adaptive metadata maintenance for fast crash recovery of dm-cache based hybrid storage devices,” in Proc. Usenix Annual Technical Conference (ATC), 2021.
- S. W. Schlosse, J. Schindler, S. Papadomanolakis, et al., “On multidimensional data and modern disks,” in Proc. the USENIX Conf. on File Storage Technol., 2005, pp. 225-238.
- R. J. Feiertag and E. I. Organick EI, “The multics input/output system,” in Proc. the 3rd ACM Symp. on Operating Systems Principles, Oct. 1971, pp. 35-41. [CrossRef]
- M. K. McKusick, W. N. Joy, S. J. Leffler, and R. S. Fabry, “A fast file system for unix,” ACM Trans. on Comput. Syst., vol. 2, no. 3, pp. 181-197, 1984. [CrossRef]
- T. Z. Teng and R. A. Gumaer, “Managing ibm database 2 buffers to maximize performance,” IBM Sys. J., vol. 23, no. 2, pp. 211-218, 1984. [CrossRef]
- P. Mead, “Oracle Rdb buffer management,www.oracle.com/technology/products/rdb/pdf/2002tech forums/rdbtf 2002buffer.pdf,” 2002.
- Jianwei Liao, “Server-side prefetching in distributed file systems,” Concurrency Computat: Pract Exper., vol.28, pp. 294-310, 2016. [CrossRef]
- Liao J, Gerofi B, Lien G-Y, et al., “A flexible I/O arbitration framework for netCDF-based big data processing workflows on high-end supercomputers,” Concurrency Computat: Pract Exper., vol.29, e4161, 2017. [CrossRef]
- Dong F, Zhou P, Liu Z, Shen D,XuZ, LuoJ, “Towards a fast and secure design for enterprise-oriented cloud storage systems,” Concurrency Computat: Pract Exper., vol. 29, e4177, 2017. [CrossRef]
- Y. Liang, R. Pan, Y. Du, C. Fu, L. Shi, T. W. Kuo, and C. J. Xue, “Read-Ahead Efficiency on Mobile Devices: Observation, Characterization, and Optimization,” IEEE Transactions on Computers, Vol. 70, No. 1, Jan. 2021. [CrossRef]
- G. Castets, P. Crowhurst, S. Garraway, and G. Rebmann, “IBM total storage enterprise storage server model 800,” IBM Redbook, 2002.
- F. Wu, H. Xi, J. Li, “Linux readahead: less tricks for more,” in Proc. the Linux Symp., 2007, pp. 273–284.
- P. Pai, B. Pulavarty, and M. Cao, “Linux 2.6 performance improvement through readahead optimization,” in Proc. the Linux Symp., 2004, pp. 391-402.
- F. Wu, H. Xi, and C. Xu, “On the design of a new linux readahead framework,” ACM SIGOPS Operat. Syst. Review, vol. 42, no. 5, pp. 75-84, 2008. [CrossRef]
- B. S. Gill and D. S. Modha, “SARC: sequential prefetching in adaptive replacement cache,” in Proc. the USENIX Annu. Tech. Conf., 2005, pp. 292-308.
- H.-P. Chang, C.-Y. Chen, C.-Y. Liu, “A prefetching scheme for multi-tiered storage systems,” in Proc. 15th IEEE International Conference on Advanced and Trusted Computing (ATC), 2018. [CrossRef]
- A. J. Uppal, R. C. Chiang, H. H. Huang, “Flashy prefetching for high-performance flash drives,” in Proc. IEEE 28th Symp. on Mass Storage Syst. and Technol. (MSST), 2012. [CrossRef]
- A. E. Papathanasiou and M. L. Scott, “Aggressive prefetching: an idea whose time has come,” in Proc. Conf. on Hot Topics in Operating Syst (HotOS), 2005, Art. no. 6.
- FIO benchmark. https://github.com/axboe/fio.
- Postmark, http://postmarkapp.com/.
- TPC-H. http://www.tpc.org/tpch/.
- Monetdb. https://www.monetdb.org/Home.
- C. Li, K. Shen, and A. E. Papathanasiou, “Competitive prefetching for concurrent sequential I/O,” in Proc. the ACM European Conf. on Comput. Syst. (EuroSys), March 2007, pp. 189-202. [CrossRef]
- S. Liang, S. Jiang, Z. Zhang, “STEP: sequentiality and thrashing detection based prefetching to improve performance of networked storage servers,” in Proc. Conf. Distributed Computing Syst. (ICDCS), July 2007. [CrossRef]
- A. D. Bathen and B. S. Gill, “AMP: adaptive multi-stream prefetching in a shared cache,” in Proc. 5th USENIX Conf. File and Storage Technol. (FAST), 2007, Art. no.. 26.
- C. Li and K. Shen, “Managing prefetch memory for data-intensive online servers,” in Proc. the 4th USENIX Conf. File and Storage Technol.; 2005, Art. no. 19.
- M. Li, E. Varki, S. Bhatia, and A. Merchant, “TaP: table-based prefetching for storage caches,” in Proc. USENIX Conf. File and Storage Technol., 2008, Art. no. 6.
- Z. Li, Z. Chen, and Y. Zhou, “Mining block correlations to improve storage performance,” ACM Trans. on Storage, vol. 1, no. 2, 2005, pp. 213-245. [CrossRef]
- G. Soundararajan, M. Mihailescu, C. Amza, “Context-aware prefetching at the storage server,” in Proc. USENIX Annu. Tech. Conf.; 2008, pp. 377-390.
- J. Ryu, D. Lee, K. G. Shin, and K. Kang, "ClusterFetch: A lightweight prefetcher for intensive disk reads," IEEE Trans. Comput., vol. 67, no. 2, pp. 284-290, Feb. 2018. [CrossRef]
- S. Jiang, X. Ding, Y. Xu, and Davis K, “A prefetching scheme exploiting both data layout and access history on disk,” ACM Trans. on Storage, vol. 9, no. 3, 2013, Art. no. 10. [CrossRef]
- G. Griffioen, “Performance measurements of automatic prefetching,” in Proc. Conf. on Parallel and Distributed Computing Syst., 1995, pp. 165-170.
- H. Lei and D. Duchamp, “An analytical approach to file prefetching,” in Proc. the USENIX Annu. Tech. Conf., Jan. 1997, pp. 21-32.
- T. M. Kroeger and D. E. Long, “Design and implementation of a predictive file prefetching algorithm,” in Proc. the USENIX Annu. Tech. Conf., 2001, pp. 105-118.
- G. Griffioen and R. Appleton, “Reducing file system latency using a predictive approach,” in Proc. the USENIX Summer Tech. Conf., Jun. 1994, pp. 13-23.
- K. M. Curewitz, P. Krishnan and J. S. Vitter, “Practical prefetching via data compression,” in Proc. ACM SIGMOD International Conf. on Management of Data (SIGMOD), 1993, pp. 257-266. [CrossRef]
- T. M. Kroeger and D. D. Long, “Predicting file system actions from prior events,” in Proc. USENIX Annu. Tech. Conf., Jan. 1996, pp. 26-35.
- Y. Joo, J. Ryu, S. Park, and K. G. Shin, “FAST: quick application launch on solid-state drives,” in Proc. USENIX Conf. on File and Storage Technol., Feb. 2011, Art. no. 19.
- A. Laga, J. Boukhobza, M. Koskas, and F. Singhoff, “Lynx: a learning linux prefetching mechanism for SSD performance model,” in Proc. Non-Volatile Memory Syst. and Applications Symp. (NVMSA), Aug. 2016, pp. 1-6. [CrossRef]
- P. Cao, E. W. Felten, A. R. Karlin, and K. Li, “Implementation and performance of integrated application-controlled file caching, prefetching, and disk scheduling,” ACM Trans. on Comput. Syst., vol. 14, pp. 311–343, 1996. [CrossRef]
- R. H. Patterson, G. A. Gibson, E. Ginting, D. Stodolsky, and J. Zelenka, “ ,” in Proc. the 15th ACM Symp. on Operating Syst. Principles (SOSP), Dec. 1995, pp. 79-95. [CrossRef]
- F. Chang and G. A. Gibson, “Automatic I/O hint generation through speculative execution,” in Proc. 3rd USENIX Symp. on Operating Syst. Design and Implementation (OSDI), 1999, pp. 1-14.
- T. C. Mowry, A. K. Demke, and O. Krieger, “Automatic compiler-inserted i/o prefetching for out-ofcore applications,” in Proc. the USENIX Symp. on Operating Syst. Design and Implementation (OSDI); Oct. 1996, pp. 3-17.
- S. W. Son, S. P. Muralidhara, O. Ozturk, M. Kandemir, I. Kolcu, M. Karakoy, “Profiler and compiler assisted adaptive I/O prefetching for shared storage caches,” in Proc. Conf. Parallel Architectures and Compilation Techniques (PACT), Oct. 2008, pp. 112-121. [CrossRef]

















Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).