Academia.eduAcademia.edu

Outline

Traversing large graphs on GPUs with unified memory

2020, Proceedings of the VLDB Endowment

https://doi.org/10.14778/3384345.3384358

Abstract

Due to the limited capacity of GPU memory, the majority of prior work on graph applications on GPUs has been restricted to graphs of modest sizes that fit in memory. Recent hardware and software advances make it possible to address much larger host memory transparently as a part of a feature known as unified virtual memory. While accessing host memory over an interconnect is understandably slower, the problem space has not been sufficiently explored in the context of a challenging workload with low computational intensity and an irregular data access pattern such as graph traversal. We analyse the performance of breadth first search (BFS) for several large graphs in the context of unified memory and identify the key factors that contribute to slowdowns. Next, we propose a lightweight offline graph reordering algorithm, HALO (Harmonic Locality Ordering), that can be used as a pre-processing step for static graphs. HALO yields speedups of 1.5x-1.9x over baseline in subsequent traversa...

References (73)

  1. Heterogeneous memory management (hmm) the linux kernel documentation. https: //www.kernel.org/doc/html/latest/vm/hmm.html. (Accessed on 02/24/2020).
  2. Shared virtual memory. https://www.khronos.org/registry/OpenCL/sdk/2. 1/docs/man/xhtml/sharedVirtualMemory.html. (Accessed on 02/24/2020).
  3. C. R. Aberger, A. Lamb, S. Tu, A. Nötzli, K. Olukotun, and C. Ré. Emptyheaded: A relational engine for graph processing. ACM Transactions on Database Systems (TODS), 42(4):20, 2017.
  4. D. Ajwani, R. Dementiev, and U. Meyer. A computational study of external-memory bfs algorithms. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 601-610. Society for Industrial and Applied Mathematics, 2006.
  5. A. Apostolico and G. Drovandi. Graph compression by bfs. Algorithms, 2(3):1031-1044, 2009.
  6. J. Arai, H. Shiokawa, T. Yamamuro, M. Onizuka, and S. Iwamura. Rabbit order: Just-in-time parallel reordering for fast graph analysis. In Parallel and Distributed Processing Symposium, 2016 IEEE International, pages 22-31. IEEE, 2016.
  7. R. Ausavarungnirun, J. Landgraf, V. Miller, S. Ghose, J. Gandhi, C. J. Rossbach, and O. Mutlu. Mosaic: A GPU Memory Manager with Application-transparent Support for Multiple Page Sizes. In Proceedings of the International Symposium on Microarchitecture (MICRO), 2017.
  8. D. A. Bader, S. Kintali, K. Madduri, and M. Mihail. Approximating betweenness centrality. In International Workshop on Algorithms and Models for the Web-Graph, pages 124-137. Springer, 2007.
  9. V. Balaji and B. Lucia. When is graph reordering an optimization? studying the effect of lightweight graph reordering across applications and input graphs. In 2018 IEEE International Symposium on Workload Characterization (IISWC), pages 203-214, Los Alamitos, CA, USA, oct 2018. IEEE Computer Society.
  10. A. Bavelas. Communication patterns in task-oriented groups. The Journal of the Acoustical Society of America, 22(6):725-730, 1950.
  11. S. Beamer, K. Asanović, and D. Patterson. Direction-optimizing breadth-first search. Scientific Programming, 21(3-4):137-148, 2013.
  12. S. Beamer, K. Asanovic, and D. Patterson. The GAP Benchmark Suite, 2015.
  13. M. Besta and T. Hoefler. Survey and taxonomy of lossless graph compression and space-efficient graph representations. arXiv preprint arXiv:1806.01799, 2018.
  14. P. Boldi, B. Codenotti, M. Santini, and S. Vigna. UbiCrawler: A Scalable Fully Distributed Web Crawler. Software: Practice & Experience, 34(8):711-726, 2004.
  15. P. Boldi, A. Marino, M. Santini, and S. Vigna. BUbiNG: Massive Crawling for the Masses. In Proceedings of the Companion Publication of the 23rd International Conference on World Wide Web, 2014.
  16. P. Boldi, M. Rosa, M. Santini, and S. Vigna. Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks. In Proceedings of the 20th international conference on World wide web, pages 587-596. ACM, 2011.
  17. P. Boldi and S. Vigna. The WebGraph Framework I: Compression Techniques. In Proc. of the Thirteenth International World Wide Web Conference (WWW 2004), 2004.
  18. P. Boldi and S. Vigna. Axioms for centrality. Internet Mathematics, 10(3-4):222-262, 2014.
  19. U. Brandes. A faster algorithm for betweenness centrality. Journal of mathematical sociology, 25(2):163-177, 2001.
  20. A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, pages 233-244. ACM, 2009.
  21. A. Buluç and K. Madduri. Parallel breadth-first search on distributed memory systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, page 65. ACM, 2011.
  22. F. Busato, O. Green, N. Bombieri, and D. A. Bader. Hornet: An efficient data structure for dynamic sparse graphs and matrices on gpus. In 2018 IEEE High Performance extreme Computing Conference (HPEC), pages 1-7. IEEE, 2018.
  23. W.-M. Chan and A. George. A linear time implementation of the reverse cuthill-mckee algorithm. BIT Numerical Mathematics, 1980.
  24. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 44-54. Ieee, 2009.
  25. F. Chierichetti, R. Kumar, S. Lattanzi, M. Mitzenmacher, A. Panconesi, and P. Raghavan. On compressing social networks. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 219-228. ACM, 2009.
  26. E. Cuthill and J. McKee. Reducing the bandwidth of sparse symmetric matrices. In Proceedings of the 1969 24th national conference, pages 157-172. ACM, 1969.
  27. A. Davidson, S. Baxter, M. Garland, and J. D. Owens. Work-efficient parallel gpu methods for single-source shortest paths. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pages 349-359. IEEE, 2014.
  28. L. Dhulipala, G. E. Blelloch, and J. Shun. Theoretically efficient parallel graph algorithms can be fast and scalable. In Proceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures, pages 393-404, 2018.
  29. L. Dhulipala, I. Kabiljo, B. Karrer, G. Ottaviano, S. Pupyrev, and A. Shalita. Compressing graphs and indexes with recursive graph bisection. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1535-1544, 2016.
  30. D. Eppstein and J. Wang. Fast approximation of centrality. In Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, pages 228-229. Society for Industrial and Applied Mathematics, 2001.
  31. P. Faldu, J. Diamond, and B. Grot. A closer look at lightweight graph reordering. 2019 IEEE International Symposium on Workload Characterization (IISWC), 2019.
  32. L. K. Fleischer, B. Hendrickson, and A. Pınar. On identifying strongly connected components in parallel. In International Parallel and Distributed Processing Symposium, pages 505-511. Springer, 2000.
  33. M. R. Garey and D. S. Johnson. Computers and intractability, volume 29.
  34. J. A. George. Computer implementation of the finite element method. Technical report, STANFORD UNIV CA DEPT OF COMPUTER SCIENCE, 1971.
  35. A. Gharaibeh, T. Reza, E. Santos-Neto, L. B. Costa, S. Sallinen, and M. Ripeanu. Efficient large-scale graph processing on hybrid cpu and gpu systems. arXiv preprint arXiv:1312.3018, 2013.
  36. O. Green and D. A. Bader. custinger: Supporting dynamic graph algorithms for gpus. In High Performance Extreme Computing Conference (HPEC), 2016 IEEE, pages 1-6. IEEE, 2016.
  37. S. Han, L. Zou, and J. X. Yu. Speeding up set intersections in graph algorithms using simd instructions. In Proceedings of the 2018 International Conference on Management of Data, pages 1587-1602. ACM, 2018.
  38. P. Harish and P. Narayanan. Accelerating large graph algorithms on the gpu using cuda. In International conference on high-performance computing, pages 197-208. Springer, 2007.
  39. M. Hussein, A. Varshney, and L. Davis. On implementing graph cuts on cuda.
  40. K. I. Karantasis, A. Lenharth, D. Nguyen, M. J. Garzarán, and K. Pingali. Parallelization of reordering algorithms for bandwidth and wavefront reduction. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 921-932. IEEE Press, 2014.
  41. G. Karypis and V. Kumar. A software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices. 1998.
  42. J. Kepner and J. Gilbert. Graph algorithms in the language of linear algebra. SIAM, 2011.
  43. B. W. Kernighan and S. Lin. An efficient heuristic procedure for partitioning graphs. Bell system technical journal, 49(2):291-307, 1970.
  44. H. Kim, J. Sim, P. Gera, R. Hadidi, and H. Kim. Batch-aware unified memory management in gpusfor irregular workloads. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 2020.
  45. G. K. Kumfert. Object-oriented algorithmic laboratory for ordering sparse matrices. Technical report, Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States), 2000.
  46. A. Kyrola, G. E. Blelloch, and C. Guestrin. Graphchi: Large-scale graph computation on just a pc. USENIX, 2012.
  47. E. Lee, J. Kim, K. Lim, S. H. Noh, and J. Seo. Pre-select static caching and neighborhood ordering for bfs-like algorithms on disk-based graph engines. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 459-474, Renton, WA, July 2019. USENIX Association.
  48. J. Leskovec and A. Krevl. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data, June 2014.
  49. C. Li, R. Ausavarungnirun, C. J. Rossbach, Y. Zhang, O. Mutlu, Y. Guo, and J. Yang. A framework for memory oversubscription management in graphics processing units. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 49-63. ACM, 2019.
  50. Y. Lim, U. Kang, and C. Faloutsos. Slashburn: Graph compression and mining beyond caveman communities. IEEE Transactions on Knowledge and Data Engineering, 26(12):3077-3089, 2014.
  51. H. Liu and H. H. Huang. Enterprise: Breadth-first graph traversal on gpus. In High Performance Computing, Networking, Storage and Analysis, 2015 SC-International Conference for, pages 1-12. IEEE, 2015.
  52. L. Luo, M. Wong, and W.-m. Hwu. An effective gpu implementation of breadth-first search. In Proceedings of the 47th design automation conference, pages 52-55. ACM, 2010.
  53. S. Maass, C. Min, S. Kashyap, W. Kang, M. Kumar, and T. Kim. Mosaic: Processing a trillion-edge graph on a single machine. In Proceedings of the Twelfth European Conference on Computer Systems, pages 527-543. ACM, 2017.
  54. J. Mackenzie, A. Mallia, M. Petri, J. S. Culpepper, and T. Suel. Compressing inverted indexes with recursive graph bisection: A reproducibility study. In Proc. ECIR, pages 339-352, 2019.
  55. E. Mastrostefano and M. Bernaschi. Efficient breadth first search on multi-gpu systems. Journal of Parallel and Distributed Computing, 73(9):1292-1305, 2013.
  56. D. Merrill, M. Garland, and A. Grimshaw. High-performance and scalable gpu graph traversal. ACM Transactions on Parallel Computing, 1(2):14, 2015.
  57. U. Meyer and P. Sanders. δ-stepping: a parallelizable shortest path algorithm. Journal of Algorithms, 49(1):114-152, 2003.
  58. S. Milgram. The small world problem. Psychology today, 2(1):60-67, 1967.
  59. R. C. Murphy, K. B. Wheeler, B. W. Barrett, and J. A. Ang. Introducing the graph 500. Cray Users Group (CUG), 19:45-74, 2010.
  60. L. Nai, Y. Xia, I. G. Tanase, H. Kim, and C.-Y. Lin. Graphbig: understanding graph computing in the context of industrial solutions. In High Performance Computing, Networking, Storage and Analysis, 2015 SC-International Conference for, pages 1-12. IEEE, 2015.
  61. Y. Pan, Y. Wang, Y. Wu, C. Yang, and J. D. Owens. Multi-gpu graph analytics. In Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International, pages 479-490. IEEE, 2017.
  62. Y. Rochat. Closeness centrality extended to unconnected graphs: The harmonic centrality index. Technical report, 2009.
  63. R. A. Rossi and N. K. Ahmed. The Network Data Repository with Interactive Graph Analytics and Visualization. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
  64. G. Sabidussi. The centrality index of a graph. Psychometrika, 31(4):581-603, 1966.
  65. N. Sakharnykh. Maximizing unified memory performance in cuda -nvidia developer blog. https://devblogs.nvidia.com/maximizing- unified-memory-performance-cuda/. (Accessed on 02/27/2020).
  66. N. Sakharnykh. Unified memory on pascal and volta. http://on-demand.gputechconf.com/gtc/2017/ presentation/s7285-nikolay-sakharnykh-unified- memory-on-pascal-and-volta.pdf, May 2017.
  67. J. A. Stratton, Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing, 127, 2012.
  68. J. Sybrandt, M. Shtutman, and I. Safro. MOLIERE: Automatic Biomedical Hypothesis Generation System. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '17, 2017.
  69. J. D. Ullman and M. Yannakakis. High-probability parallel transitive-closure algorithms. SIAM Journal on Computing, 20(1):100-125, 1991.
  70. Y. Wang, A. Davidson, Y. Pan, Y. Wu, A. Riffel, and J. D. Owens. Gunrock: A high-performance graph processing library on the gpu. In ACM SIGPLAN Notices, volume 51, page 11. ACM, 2016.
  71. H. Wei, J. X. Yu, C. Lu, and X. Lin. Speedup graph processing by graph ordering. In Proceedings of the 2016 International Conference on Management of Data, pages 1813-1828. ACM, 2016.
  72. Y. Zhang, V. Kiriansky, C. Mendis, S. Amarasinghe, and M. Zaharia. Making caches work for graph analytics. In 2017 IEEE International Conference on Big Data (Big Data), pages 293-302. IEEE, 2017.
  73. T. Zheng, D. Nellans, A. Zulfiqar, M. Stephenson, and S. W. Keckler. Towards High Performance Paged Memory for GPUs. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), 2016.