Academia.eduAcademia.edu

Compute Unified Device Architecture NVIDIA CUDA

description898 papers
group15,599 followers
lightbulbAbout this topic
Compute Unified Device Architecture (CUDA) is a parallel computing platform and application programming interface (API) model created by NVIDIA. It enables developers to utilize the power of NVIDIA GPUs for general-purpose processing, allowing for the execution of complex computations across multiple cores, thereby enhancing performance in various computational tasks.
lightbulbAbout this topic
Compute Unified Device Architecture (CUDA) is a parallel computing platform and application programming interface (API) model created by NVIDIA. It enables developers to utilize the power of NVIDIA GPUs for general-purpose processing, allowing for the execution of complex computations across multiple cores, thereby enhancing performance in various computational tasks.

Key research themes

1. How can CUDA optimize parallel image convolution computations to enhance GPU performance?

This theme investigates CUDA implementations for image convolution, a fundamental operation in image processing, focusing on maximizing parallelism, efficient shared memory usage, and reducing idle threads to exploit GPU resources fully. Efficient convolution enhances performance in diverse fields such as computer vision, medical imaging, and graphics.

Key finding: The paper presents a CUDA approach leveraging separable filters to reduce convolution complexity from O(n*m) multiplications to O(n+m), significantly improving performance. It highlights optimized shared memory usage with an... Read more

2. How can parallel GPU programming models, including CUDA, HIP, and OpenACC, be evaluated and optimized for performance portability across heterogeneous GPU architectures?

This theme focuses on comparative analyses of GPU programming models in the CUDA ecosystem and beyond, emphasizing portability, ease of use, performance tuning, and compatibility with emerging GPU architectures such as AMD Instinct GPUs. It explores tools and methodologies to port CUDA code to HIP and other models, bench-marking performance trade-offs and compiler toolchains—important for developing scalable HPC applications on increasingly diverse GPU hardware.

Key finding: IPMACC translates OpenACC directives into CUDA or OpenCL code, enabling evaluation of OpenACC’s expressiveness and performance against finely optimized CUDA implementations. By compiling OpenACC to CUDA, this work exposes the... Read more
Key finding: This study comprehensively evaluates several GPU programming models—including CUDA, HIP, OpenMP offloading, hipSYCL, Kokkos, and Alpaka—on NVIDIA and AMD GPUs (e.g., NVIDIA V100/A100 and AMD MI100). It demonstrates that HIP... Read more

3. What methods improve software intellectual property protection and enable reverse engineering analysis for CUDA applications?

This research cluster explores techniques pertinent to the software protection and forensic reverse engineering domains focused on compiled CUDA binaries. Considering NVIDIA’s CUDA binary formats and compiler behavior, the works analyze static and dynamic reverse engineering strategies and propose best practices for securing CUDA code to prevent intellectual property theft or unauthorized code analysis, critical for software developers deploying proprietary algorithms on GPUs.

Key finding: The authors reveal that default compilation settings of NVIDIA’s CUDA compiler inadvertently facilitate reverse engineering by leaking significant information. By analyzing binary formats and employing static and dynamic... Read more

4. How can GPU-accelerated frameworks like RAPIDS and multi-GPU CUDA programming improve data-parallel machine learning workloads?

This theme examines integrating GPU-accelerated libraries with CUDA-enabled hardware to optimize parallel machine learning workflows. It focuses on leveraging multi-GPU distributed training, data parallelism, and pipeline parallelism through frameworks like RAPIDS and DASK, quantifying scalability, speedups, and communication overhead to demonstrate practical improvements in big data and AI applications.

Key finding: Demonstrating multi-GPU acceleration using RAPIDS with DASK for distributed data-parallel training, this work shows significant scalability (parallel fraction of 98.7%) and speedup with low serialization overhead (Karp-Flatt... Read more

All papers in Compute Unified Device Architecture NVIDIA CUDA

Complex systems research has long faced the fundamental challenge of disciplinary isolation and lack of unified mathematical language. This paper starts from a classic constrained dynamics problem, whose analytical solution reveals a... more
Primeiramente, como de praxe de minha parte, agradeço a Deus, seja lá como ele for, por existir. Agradeço aos meus pais e minha irmã por me conhecerem o suficiente para confiarem a mim meu trabalho e não a responsabilidade de cuidar de... more
A aplicacao em logistica de distribuicao e diversa, a exemplo do planejamento de transporte e entrega de mercadorias ou no roteamento de dados em redes de telecomunicacoes. Dado a amplitude e capilaridade desses problemas, trabalhos vem... more
Bio-inspired techniques like Genetic Algorithms have a comprehensive applicability to optimization problems. Given the ease of parallelism implementation inherent of these techniques several researches have been developed in such area... more
Graphics Processing Units (GPUs), originally developed for computer games, now provide computational power for scientific applications. A study on the comparison of computational speed-up and efficiency of a GPU with a CPU for the Finite... more
This paper presents optical flow estimation technique to estimate the motion vectors in each frame of the video sequence. By thresholding and performing morphological closing on the motion vectors, we produces binary feature images. Using... more
The numerical solution of the ideal magnetohidrodinamica equations (MHD) is presented. The study corresponds to a non-steady problem. The technique introduced by Powell [1] is used. The eigenvectors are normalized to avoid problems during... more
In questo lavoro, dopo alcuni cenni storici sullo sviluppo della teoria della valutazione delle opzioni, si è messo in evidenza che la mancanza di negoziabilità delle stock options assegnate a dirigenti e dipendenti non mina le fondamenta... more
This paper describes a numerical method for the parallel solution of the differential measure inclusion problem posed by mechanical multibody systems containing bilateral and unilateral frictional constraints. The method proposed has been... more
The projec developed in collaboration with Professor D.Mari, Ospedale Maggiore Policlinico of Milan, the expert advice on anticoagulation in elderly
GPU based on CUDA Architecture developed by NVIDIA is a high performance computing device. Multiplication of matrices of large order can be computed in few seconds using GPU based on CUDA Architecture. A modern GPU consists of 16 highly... more
This report details the implementation of a snapshot-driven Artificial General Intelligence (AGI) agent that integrates environmental state processing, natural language understanding, and symbolic reasoning. The agent operates in the... more
The image segmentation consists in obtaining a region of interest within a larger area. GrabCut is a recent tech- nique of 2D segmentation which presents excellent results. This is based on representing the image as a flow network, and... more
The Uintah computational framework is used for the parallel solution of partial differential equations on adaptive mesh refinement grids using modern supercomputers. Uintah is structured with an application layer and a separate runtime... more
The fast-growing advancement in AI technologies has resulted in huge loads on the data center architecture resulting in the need to create extremely resistant, and fault-tolerant AI fabrics. This paper looks at AI design principles and... more
The mining association algorithm well known as A-priori is one of the most popular data mining algorithms, The mining association algorithm requires two parameters support and confidence to derive rules, The A-priori algorithm scans the... more
Breadth-first search (BFS) has wide applications in electronic design automation (EDA) as well as in other fields. Researchers have tried to accelerate BFS on the GPU, but the two published works are both asymptotically slower than the... more
lthough the scientific computing community has questioned graphics processing unit (GPU) efficiency when it comes to energy per operation, the latest Green500 list (www.green500.org) should put to rest these concerns. The Green500 sorts... more
This paper introduces a library which supports programmers to write parallel programs on GPU architecture, especially with a system consisting of multi-GPUs. The library is designed from the idea of skeletons, which helps us to make... more
This study examines the integration of fundamental physical principles into three-dimensional computational algorithms and analyzes how digital systems simulate physical phenomena from the real world. Through a systematic analysis of... more
Quantum computing emerges as a field that captures a great theoretical interest. Its simulation represents a problem with high memory and computational requirements which makes advisable the use of parallel platforms. In this work we deal... more
A utilização de Unidades de Processamento Gráfico para Propósito Geral (GPGPU) tem crescido muito nos últimos anos. Uma das arquiteturas que se utilizam desse conceito é a arquitetura CUDA da Nvidia, que consegue aumentos significativos... more
Ключевые слова: полосно-пропускающий фильтр, фотонный кристалл, микрополосковое устройство.
We present a formal synthesis of recently proposed concepts in recursive prime theory and harmonic analysis with classical principles of number theory and fractal geometry. Building on the frameworks introduced in "The Hidden Simplicity... more
Se analiza la performance, y el grado de aceleración obtenido en relación a la solución secuencial. A modo de referencia, se compara con la paralelización en un cluster de multicores. Como herramientas de programación se utilizaron CUDA... more
We present an evaluation of the heterogeneous memory system of the Intel Xeon Phi KNL architecture, using applications with di!erent characteristics. Applications that perform many data transfer operations from/to to main memory, when... more
A computação do fecho transitivo de um grafo é um problema que foi considerado pela primeira vez em 1959. Muitos algoritmos sequenciais para solução deste problema foram propostos e algoritmos paralelos foram considerados a partir de... more
A collision involving only two bodies. -A collision scenario where more than two bodies are involved concurrently. -Concurrent collisions sharing at least one common rigid body number in their collision pair lists. -Collisions occurring... more
Dichos métodos consisten en procesar el espectro de las imágenes por un banco de filtros para, a partir de ahí, extraer las características que más información proporcionen para la posterior fase de clasificación. Concretamente, se... more
In Science a large number of areas are being benefited by the reduction of computational time with the use of Graphics Processing Units (GPU). In the case of Epidemiology through the speeding of the simulation of scenarios with large... more
Helicopters can experience brownout when flying close to a dusty surface. The uplifting of dust in the air can remarkably restrict the pilot’s visibility area. Consequently, a brownout can disorient the pilot and lead to the helicopter... more
The paper presents an overview of recent research on the Particle Swarm Optimization (PSO) algorithm parallelization on the Graphics Processing Unit for general-purpose computations (GPGPU). This survey attempts to collect, organize, and... more
In the rapidly evolving landscape of artificial intelligence, code generation has emerged as a critical frontier for automating software development, optimizing workflows, and democratizing programming expertise. Among the pioneers in... more
The current research paper aims to outline the capabilities of GP-GPU accelerated parallel computing for database operations and compare it with Central Processing Units (CPU)-based methodologies. The study evaluates the performance,... more
Quantum computing emerges as a field that captures a great theoretical interest. Its simulation represents a problem with high memory and computational requirements which makes advisable the use of parallel platforms. In this work we deal... more
We propose a novel approach for runtime verification on computers with a large number of computation cores, without any hardware extension to mainstream PC environment. The goal of the approach is making use of all hardware resources to... more
This thesis examines the parallelization techniques that can be used to speed up econometric computations. Specifically, Monte Carlo simulations implemented using the R programming language. Parallelized Monte Carlo simulations can use... more
Am zweiten Oktober-Wochenende des Jahres 1913 versammelten sich weit über zweitausend junge Männer und Frauen auf dem Hohen Meißner in Hessen. Das Treffen fand auf Einladung einer losen Vereinigung von Jugendbünden statt und sollte den... more
The on-line event reconstruction in ALICE is performed by the High Level Trigger, which should process up to 2000 events per second in proton-proton collisions and up to 300 central events per second in heavy-ion collisions, corresponding... more
The polyhedral model is powerful for analyzing and transforming static control programs, hence its intensive use for the optimization of data locality and automatic parallelization. Affine transformations excel at modeling control flow,... more
In this age where data is growing at an astronomical rate, with unfettered access to digital information, complexities have been introduced to scientific computations, analysis, and inferences. This is because such data could not be... more
Due to the limited capacity of GPU memory, the majority of prior work on graph applications on GPUs has been restricted to graphs of modest sizes that fit in memory. Recent hardware and software advances make it possible to address much... more
Due to the limited capacity of GPU memory, the majority of prior work on graph applications on GPUs has been restricted to graphs of modest sizes that fit in memory. Recent hardware and software advances make it possible to address much... more
In this paper, we develop a novel kind of energy-dissipation splitting finite-difference timedomain scheme for solving two-dimensional Maxwell equations with perfectly matched layers. The discrete energy dissipation law, convergence,... more
Engineering is many times described as making practical application of the knowledge of pure sciences in the solution of a problem or the application of scientific and mathematical principles to develop economical solutions to technical... more
The paper discusses digital filter bank implementation using the weighted overlap-add (WOLA) algorithm modification introduced in the context of multichannel signal processing. The suggested modification is applied to hydroacoustic... more
Real-time depth extraction from stereo images is an important process in computer vision. This paper proposes a new implementation of the dynamic programming algorithm to calculate dense depth maps using the CUDA architecture achieving... more
Este trabajo tiene como objetivo presentar las principales características del diseño e implementación de algoritmos paralelos para el procesamiento de señales de radar de apertura sintétitca (SAR). Se analizan las razones por las que el... more
Download research papers for free!