A novel approach to generation of tiled code for arbitrarily nested loops is presented. It is derived via a combination of the polyhedral and iteration space slicing frameworks. Instead of program transformations represented by a set of affine functions, one for each statement, it uses the transitive closure of a loop nest dependence graph to carry out corrections of original rectangular tiles so that all dependences of the original loop nest are preserved under the lexicographic order of target tiles. Parallel tiled code can be generated on the basis of valid serial tiled code by means of applying affine transformations or transitive closure using on input an inter-tile dependence graph whose vertices are represented by target tiles while edges connect dependent target tiles. We demonstrate how a relation describing such a graph can be formed. The main merit of the presented approach in comparison with the well-known ones is that it does not require full permutability of loops to generate both serial and parallel tiled codes; this increases the scope of loop nests to be tiled.
If the inline PDF is not rendering correctly, you can download the PDF file here.
Ahmed N. Mateev N. and Pingali K. (2000). Tiling imperfectly-nested loop nests ACM/IEEE 2000 Conference on Supercomputing Dallas TX USA Article No. 31.
Andonov R. Balev S. Rajopadhye S. and Yanev N. (2001). Optimal semi-oblique tiling IEEE Transactions on Parallel and Distributed Systems 14(9): 940-966.
Bastoul C. (2004). Code generation in the polyhedral model is easier than you think PACT’13 IEEE International Conference on Parallel Architecture and Compilation Techniques Juan-les-Pins France pp. 7-16.
Bastoul C. and Feautrier P. (2003). Improving data locality by chunking International Conference on Compiler Construction Warsaw Poland pp. 320-335.
Beletska A. Bielecki W. Cohen A. Palkowski M. and Siedlecki K. (2011). Coarse-grained loop parallelization: Iteration space slicing vs affine transformations Parallel Computing 37(8): 479-497.
Bielecki W. Kraska K. and Klimek T. (2014). Using basis dependence distance vectors to calculate the transitive closure of dependence relations by means of the Floyd-Warshall algorithm Journal of Combinatorial Optimization 30(2): 253-275.
BieleckiW. Klimek T. Palkowski M. and Beletska A. (2010). An iterative algorithm of computing the transitive closure of a union of parameterized affine integer tuple relations in W. Wu and O. Daescu (Eds.) COCOA 2010: Fourth International Conference on Combinatorial Optimization and Applications Lecture Notes in Computer Science Vol. 6508 Springer Berlin/Heidelberg pp. 104-113.
Bielecki W. and Palkowski M. (2015). Perfectly nested loop tiling transformations based on the transitive closure of the program dependence graph in A. Wilinski et al. (Eds.) Soft Computing in Computer and Information Science Advances in Intelligent Systems and Computing Vol. 342 Springer International Publishing Cham pp. 309-320.
Bielecki W. Palkowski M. and Klimek T. (2012). Free scheduling for statement instances of parameterized arbitrarily nested affine loops Parallel Computing 38(9): 518-532.
Bielecki W. Palkowski M. and Klimek T. (2015). Free scheduling of tiles based on the transitive closure of dependence graphs in R. Wyrzykowski (Ed.) 11th International Conference on Parallel Processing and Applied Mathematics Part II Lecture Notes in Computer Science Vol. 9574 Springer Berlin/Heidelberg pp. 133-142.
Błaszczyk J. Karbowski A. and Malinowski K. (2007). Object library of algorithms for dynamic optimization problems: Benchmarking SQP and nonlinear interior point methods International Journal of Applied Mathematics and Computer Science 17(4): 515-537 DOI: 10.2478/v10006-007-0043-y.
Bondhugula U. Baskaran M. Krishnamoorthy S. Ramanujam J. Rountev A. and Sadayappan P. (2008a). Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model in L. Hendren (Ed.) Compiler Constructure Lecture Notes in Computer Science Vol. 4959 Springer Berlin/Heidelberg pp. 132-146.
Bondhugula U. Hartono A. Ramanujam J. and Sadayappan P. (2008b). A practical automatic polyhedral parallelizer and locality optimizer ACM SIGPLAN Notices 43(6): 101-113.
Campbell S.L. (2001). Numerical analysis and systems theory International Journal of Applied Mathematics and Computer Science 11(5): 1025-1034.
Feautrier P. (1992a). Some efficient solutions to the affine scheduling problem I: One-dimensional time International Journal of Parallel Programming 21(5): 313-348.
Feautrier P. (1992b). Some efficient solutions to the affine scheduling problem II: Multidimensional time International Journal of Parallel Programming 21(6): 389-420.
Gan G. Wang X. Manzano J. and Gao G.R. (2009). Tile reduction: The first step towards tile aware parallelization in openmp in M.S. Muller et al. (Eds.) Evolving OpenMP in an Age of Extreme Parallelism Springer Berlin/Heidelberg pp. 140-153.
Greenbaum A. and Chartier T.P. (2012). Numerical Methods: Design Analysis and Computer Implementation of Algorithms Princeton University Press Princeton NJ.
Griebl M. (2004). Automatic Parallelization of Loop Programs for Distributed Memory Architectures D.Sc. thesis University of Passau Passau.
Griebl M. Feautrier P. and Lengauer C. (2000). Index set splitting International Journal of Parallel Programming 28(6): 607-631.
Grosser T. Cohen A. Kelly P.H. Ramanujam J. Sadayappan P. and Verdoolaege S. (2013). Split tiling for GPUS: Automatic parallelization using trapezoidal tiles Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units Houston TX USA pp. 24-31.
Irigoin F. and Triolet R. (1988). Supernode partitioning Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages POPL’88 San Diego CA USA pp. 319-329.
Jeffers J. and Reinders J. (2015). High Performance Parallelism Pearls Volume Two: Multicore and Many-Core Programming Approaches Morgan Kaufmann Burlington MA.
Kelly W. Maslov V. Pugh W. Rosser E. Shpeisman T. and Wonnacott D. (1995). The omega library interface guide Technical report University of Maryland at College Park MD.
Kelly W. Pugh W. Rosser E. and Shpeisman T. (1996). Transitive closure of infinite graphs and its applications International Journal of Parallel Programming 24(6): 579-598.
Kim D. and Rajopadhye S.V. (2009). Parameterized tiling for imperfectly nested loops Technical Report CS-09-101 Colorado State University Fort Collins CO.
Kowarschik M. and Weiß C. (2003). An overview of cache optimization techniques and cache-aware numerical algorithms in U. Meyer et al. (Eds.) Algorithms for Memory Hierarchies Springer Berlin/Heidelberg pp. 213-232.
Leader J.J. (2004). Numerical Analysis and Scientific Computation Pearson Addison/Wesley Boston MA.
Lim A. Cheong G.I. and Lam M.S. (1999). An affine partitioning algorithm to maximize parallelism and minimize communication Proceedings of the 13th ACM SIGARCH International Conference on Supercomputing Rhodes Greece pp. 228-237.
Lim A.W. and Lam M.S. (1994). Communication-free parallelization via affine transformations in K. Pingali et al. (Eds.) 24th ACM Symposium on Principles of Programming Languages Springer-Verlag Berlin/Heidelberg pp. 92-106.
Maciążek M. Grabowski D. and PaskoM. (2015). Genetic and combinatorial algorithms for optimal sizing and placement of active power filters International Journal of Applied Mathematics and Computer Science 25(2): 269-279 DOI: 10.1515/amcs-2015-0021.
McMahon F.H. (1986). The Livermore Fortran kernels: A computer test of the numerical performance range Technical Report UCRL-53745 Lawrence Livermore National Laboratory Livermore CA.
Mullapudi R.T. and Bondhugula U. (2014). Tiling for dynamic scheduling IMPACT 2014 14th International Workshop on Polyhedral Compilation Techniques Vienna Austria.
NAS (2015). NAS benchmarks suite http://www.nas.nasa.gov.
OpenMP Architecture Review Board (2012). OpenMP application program interface version 4.0 http:// www.openmp.org/mp-documents/OpenMP4.0RC1_final.pdf.
Palkowski M. Klimek T. and BieleckiW. (2015). TRACO: An automatic loop nest parallelizer for numerical applications Federated Conference on Computer Science and Information Systems Łódź Poland pp. 681-686
Pol (2012). The Polyhedral benchmark suite http://www.cse.ohio-state.edu/~pouchet/software/polybench/.
Pugh W. and Rosser E. (1997). Iteration space slicing and its application to communication optimization International Conference on Supercomputing Vienna Austria pp. 221-228.
Pugh W. and Rosser E. (1999). Iteration space slicing for locality in L. Carter and J. Ferrante (Eds.) Languages and Compilers for Parallel Computing Lecture Notes in Computer Science Vol. 1863 Springer Berlin/Heidelberg pp. 164-184.
Pugh W. and Wonnacott D. (1993). An exact method for analysis of value-based array data dependences 6th Annual Workshop on Programming Languages and Compilers for Parallel Computing Portland OR USA pp. 546-566.
Pugh W. and Wonnacott D. (1994). Static analysis of upper and lower bounds on dependences and parallelism ACM Transactions on Programming Languages and Systems 16(4): 1248-1278.
Ramanujam J. and Sadayappan P. (1992). Tiling multidimensional iteration spaces for multicomputers Journal of Parallel and Distributed Computing 16(2): 108-120.
Sass R. and Mutka M. (1994). Enabling unimodular transformations Proceedings of the 1994 ACM/IEEE Conference on Supercomputing Washington DC USA pp. 753-762.
Strout M.M. Carter L. Ferrante J. and Kreaseck B. (2004). Sparse tiling for stationary iterative methods International Journal of High Performance Computing Applications 18(1): 2004.
Tang P. and Xue J. (2000). Generating efficient tiled code for distributed memory machines Parallel Computing 26(11): 1369-1410.
Verdoolaege S. (2011). Integer set library-manual http:// www.kotnet.org/~skimo//isl/manual.pdf.
Verdoolaege S. (2012). Barvinok: User guide Barvinok-0.36 www.garage.kotnet.org/~skimo/barvinok/barvinok.pdf.
Verdoolaege S. Cohen A. and Beletska A. (2011). Transitive closures of affine integer tuple relations and their overapproximations in E. Yahav (Ed.) Proceedings of the 18th international Conference on Static analysis SAS’11 Springer-Verlag Berlin/Heidelberg pp. 216-232.
Wolf M.E. and Lam M.S. (1991). A data locality optimizing algorithm Proceedings of the ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation Toronto Canada pp. 30-44.
Wonnacott D.G. and Strout M.M. (2013). On the scalability of loop tiling techniques Proceedings of the 3rd International Workshop on Polyhedral Compilation Techniques (IMPACT) Berlin Germany.
Wonnacott D. Jin T. and Lake A. (2015). Automatic tiling of mostly-tileable loop nests IMPACT 2015 5th International Workshop on Polyhedral Compilation Techniques Amsterdam The Netherlands.
Xue J. (1996). Communication-minimal tiling of uniform dependence loops Languages and Compilers for Parallel Computing Springer Berlin/Heidelberg pp. 330-349.
Xue J. (1997). On tiling as a loop transformation Parallel Processing Letters 7(4): 409-424.
Xue J. (2012). Loop Tiling for Parallelism Springer Science & Business Media Springer-Verlag New York NY.
Zdunek R. (2014). Regularized nonnegative matrix factorization: Geometrical interpretation and application to spectral unmixing International Journal of Applied Mathematics and Computer Science 24(2): 233-247 DOI: 10.2478/amcs-2014-0017