Compute Unified Device Architecture NVIDIA CUDA

description898 papers

group15,599 followers

lightbulbAbout this topic

Compute Unified Device Architecture (CUDA) is a parallel computing platform and application programming interface (API) model created by NVIDIA. It enables developers to utilize the power of NVIDIA GPUs for general-purpose processing, allowing for the execution of complex computations across multiple cores, thereby enhancing performance in various computational tasks.

lightbulbAbout this topic

Key research themes

1. How can CUDA optimize parallel image convolution computations to enhance GPU performance?

This theme investigates CUDA implementations for image convolution, a fundamental operation in image processing, focusing on maximizing parallelism, efficient shared memory usage, and reducing idle threads to exploit GPU resources fully. Efficient convolution enhances performance in diverse fields such as computer vision, medical imaging, and graphics.

Image Convolution with CUDA

by mingming kong

2017

Key finding: The paper presents a CUDA approach leveraging separable filters to reduce convolution complexity from O(n*m) multiplications to O(n+m), significantly improving performance. It highlights optimized shared memory usage with an... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. How can parallel GPU programming models, including CUDA, HIP, and OpenACC, be evaluated and optimized for performance portability across heterogeneous GPU architectures?

This theme focuses on comparative analyses of GPU programming models in the CUDA ecosystem and beyond, emphasizing portability, ease of use, performance tuning, and compatibility with emerging GPU architectures such as AMD Instinct GPUs. It explores tools and methodologies to port CUDA code to HIP and other models, bench-marking performance trade-offs and compiler toolchains—important for developing scalable HPC applications on increasingly diverse GPU hardware.

IPMACC: Open Source OpenACC to CUDA/OpenCL Translator

by Amirali Baniasadi

2021

Key finding: IPMACC translates OpenACC directives into CUDA or OpenCL code, enabling evaluation of OpenACC’s expressiveness and performance against finely optimized CUDA implementations. By compiling OpenACC to CUDA, this work exposes the... Read more

articleView Paper downloadDownload

Evaluating GPU Programming Models for the LUMI Supercomputer

by Michael Bussmann

2024, Supercomputing Frontiers

Key finding: This study comprehensively evaluates several GPU programming models—including CUDA, HIP, OpenMP offloading, hipSYCL, Kokkos, and Alpaka—on NVIDIA and AMD GPUs (e.g., NVIDIA V100/A100 and AMD MI100). It demonstrates that HIP... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

3. What methods improve software intellectual property protection and enable reverse engineering analysis for CUDA applications?

This research cluster explores techniques pertinent to the software protection and forensic reverse engineering domains focused on compiled CUDA binaries. Considering NVIDIA’s CUDA binary formats and compiler behavior, the works analyze static and dynamic reverse engineering strategies and propose best practices for securing CUDA code to prevent intellectual property theft or unauthorized code analysis, critical for software developers deploying proprietary algorithms on GPUs.

Strategies for Protecting Intellectual Property when Using CUDA Applications on Graphics Processing Units

by Xavier Bellekens and

2016

Key finding: The authors reveal that default compilation settings of NVIDIA’s CUDA compiler inadvertently facilitate reverse engineering by leaking significant information. By analyzing binary formats and employing static and dynamic... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

4. How can GPU-accelerated frameworks like RAPIDS and multi-GPU CUDA programming improve data-parallel machine learning workloads?

This theme examines integrating GPU-accelerated libraries with CUDA-enabled hardware to optimize parallel machine learning workflows. It focuses on leveraging multi-GPU distributed training, data parallelism, and pipeline parallelism through frameworks like RAPIDS and DASK, quantifying scalability, speedups, and communication overhead to demonstrate practical improvements in big data and AI applications.

Performance Analysis of Parallel Programs with RAPIDS as a Framework of Execution Easychair

by Seyi T O P E Ogunji

2024, EPiC Series in Computing Volume 104, 2024, Pages 243–267 Proceedings of 3rd International Workshop on Mathematical Modeling and Scientific Computing

Key finding: Demonstrating multi-GPU acceleration using RAPIDS with DASK for distributed data-parallel training, this work shows significant scalability (parallel fraction of 98.7%) and speedup with low serialization overhead (Karp-Flatt... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

All papers in Compute Unified Device Architecture NVIDIA CUDA

A Unified Architectural Theory of Complex Systems: From Dual Paradigms to Quantum Social Dynamics

by Fugang Tan

2025

Complex systems research has long faced the fundamental challenge of disciplinary isolation and lack of unified mathematical language. This paper starts from a classic constrained dynamics problem, whose analytical solution reveals a... more

descriptionView Paper arrow_downwardDownload

Avaliação de técnicas de paralelização de algoritmos bioinspirados utilizando computação GPU: um estudo de casos para otimização de roteamento em redes ópticas

by Vincent Tadaiesky

2025

Primeiramente, como de praxe de minha parte, agradeço a Deus, seja lá como ele for, por existir. Agradeço aos meus pais e minha irmã por me conhecerem o suficiente para confiarem a mim meu trabalho e não a responsabilidade de cuidar de... more

descriptionView Paper arrow_downwardDownload

Avaliação de técnicas de paralelização de algoritmos bioinspirados utilizando computação GPU: um estudo de casos para otimização de roteamento em redes ópticas

by Vincent Tadaiesky

2025

A aplicacao em logistica de distribuicao e diversa, a exemplo do planejamento de transporte e entrega de mercadorias ou no roteamento de dados em redes de telecomunicacoes. Dado a amplitude e capilaridade desses problemas, trabalhos vem... more

descriptionView Paper arrow_downwardDownload

Runtime Performance Evaluation of GPU and CPU using a Genetic Algorithm Based on Neighborhood Model

by Vincent Tadaiesky

2025

Bio-inspired techniques like Genetic Algorithms have a comprehensive applicability to optimization problems. Given the ease of parallelism implementation inherent of these techniques several researches have been developed in such area... more

descriptionView Paper arrow_downwardDownload

GPU computing for meshfree particle method

by Sudarshan Tiwari

2025, International Journal of Numerical Analysis and Modeling

Graphics Processing Units (GPUs), originally developed for computer games, now provide computational power for scientific applications. A study on the comparison of computational speed-up and efficiency of a GPU with a CPU for the Finite... more

descriptionView Paper arrow_downwardDownload

Motion Analysis in Video Using Optical Flow Techniques

by Atish Khobragade

2025

This paper presents optical flow estimation technique to estimate the motion vectors in each frame of the video sequence. By thresholding and performing morphological closing on the motion vectors, we produces binary feature images. Using... more

descriptionView Paper arrow_downwardDownload

Solución de las Ecuaciones de la Magnetohidrodinámica Ideal por Medio de un Esquema TVD

by Sergio ELASKAR

2025

The numerical solution of the ideal magnetohidrodinamica equations (MHD) is presented. The study corresponds to a non-steady problem. The technique introduced by Powell [1] is used. The eigenvectors are normalized to avoid problems during... more

descriptionView Paper arrow_downwardDownload

Valore soggettivo e oggettivo degli incentivi in forma di stock options

by Emilio Barone

2025, Working Paper

In questo lavoro, dopo alcuni cenni storici sullo sviluppo della teoria della valutazione delle opzioni, si è messo in evidenza che la mancanza di negoziabilità delle stock options assegnate a dirigenti e dipendenti non mina le fondamenta... more

descriptionView Paper arrow_downwardDownload

A Parallel Algorithm for Solving Complex Multibody Problems With Stream Processors

by Mihai Anitescu

2025, Volume 4: 7th International Conference on Multibody Systems, Nonlinear Dynamics, and Control, Parts A, B and C

This paper describes a numerical method for the parallel solution of the differential measure inclusion problem posed by mechanical multibody systems containing bilateral and unilateral frictional constraints. The method proposed has been... more

descriptionView Paper arrow_downwardDownload

Network Based Simulation on HPC for Translational Medicine: an Application to Anticoagulation

by Davide Castaldi

2025

The projec developed in collaboration with Professor D.Mari, Ospedale Maggiore Policlinico of Milan, the expert advice on anticoagulation in elderly

descriptionView Paper arrow_downwardDownload

A Performance Criteria for parallel Computation on basis of block size using CUDA Architecture

by Ashis Dash

2025

GPU based on CUDA Architecture developed by NVIDIA is a high performance computing device. Multiplication of matrices of large order can be computed in few seconds using GPU based on CUDA Architecture. A modern GPU consists of 16 highly... more

descriptionView Paper arrow_downwardDownload

Snapshot-Driven AGI Agent with Language and Symbolic Integration

by Chinmay Kansara

2025

This report details the implementation of a snapshot-driven Artificial General Intelligence (AGI) agent that integrates environmental state processing, natural language understanding, and symbolic reasoning. The agent operates in the... more

descriptionView Paper arrow_downwardDownload

3D GrabCut: Una segmentación de volúmenes basada en la técnica GrabCut utilizando la GPU

by Rhadamés Carmona

2025

The image segmentation consists in obtaining a region of interest within a larger area. GrabCut is a recent tech- nique of 2D segmentation which presents excellent results. This is based on representing the image as a flow network, and... more

descriptionView Paper arrow_downwardDownload

Automatic Halo Management for the Uintah GPU-Heterogeneous Asynchronous Many-Task Runtime

by Brad Peterson

2025, International Journal of Parallel Programming

The Uintah computational framework is used for the parallel solution of partial differential equations on adaptive mesh refinement grids using modern supercomputers. Uintah is structured with an application layer and a separate runtime... more

descriptionView Paper arrow_downwardDownload

Architecting highly resilient AI Fabrics: A Blueprint for Next-Gen Data Centers

by Oluwatosin Oladayo Aramide

2025, World Journal of Advanced Engineering Technology and Sciences

The fast-growing advancement in AI technologies has resulted in huge loads on the data center architecture resulting in the need to create extremely resistant, and fault-tolerant AI fabrics. This paper looks at AI design principles and... more

descriptionView Paper arrow_downwardDownload

Developing a-prior for fast mining association rules

by ghazijohnny J O H N N Y johnny

2025, Technical Magazine Iraq baghdad

The mining association algorithm well known as A-priori is one of the most popular data mining algorithms, The mining association algorithm requires two parameters support and confidence to derive rules, The A-priori algorithm scans the... more

descriptionView Paper arrow_downwardDownload

An effective GPU implementation of breadth-first search

by Wen-mei Hwu

2025, Proceedings of the 47th Design Automation Conference

Breadth-first search (BFS) has wide applications in electronic design automation (EDA) as well as in other fields. Researchers have tried to accelerate BFS on the GPU, but the two published works are both asymptotically slower than the... more

descriptionView Paper arrow_downwardDownload

EcoG: A Power-Efficient GPU Cluster Architecture for Scientific Computing

by Wen-mei Hwu

2025, Computing in Science & Engineering

lthough the scientific computing community has questioned graphics processing unit (GPU) efficiency when it comes to energy per operation, the latest Green500 list (www.green500.org) should put to rest these concerns. The Green500 sorts... more

descriptionView Paper arrow_downwardDownload

An Intermediate Library for Multi-GPUs Computing Skeletons

by Đức Trung Lê

2025, 2012 IEEE RIVF International Conference on Computing & Communication Technologies, Research, Innovation, and Vision for the Future

This paper introduces a library which supports programmers to write parallel programs on GPU architecture, especially with a system consisting of multi-GPUs. The library is designed from the idea of skeletons, which helps us to make... more

descriptionView Paper arrow_downwardDownload

COMPUTATIONAL APPLICATION OF PHYSICAL LAWS IN 3D PROGRAMMING ENVIRONMENTS: AN ANALYSIS OF REAL-TIME SIMULATION SYSTEMS

by Elif Ceren Okur

2025

This study examines the integration of fundamental physical principles into three-dimensional computational algorithms and analyzes how digital systems simulate physical phenomena from the real world. Through a systematic analysis of... more

descriptionView Paper arrow_downwardDownload

Семейство процессоров обработки сигналов с векторно-матричной архитектурой NeuroMatrix

by Павел Шевченко

2025

descriptionView Paper arrow_downwardDownload

Quantum computer simulation using the CUDA programming model

by SERGIO ROBERTO AGUERO ROMERO

2025, Computer Physics Communications

Quantum computing emerges as a field that captures a great theoretical interest. Its simulation represents a problem with high memory and computational requirements which makes advisable the use of parallel platforms. In this work we deal... more

descriptionView Paper arrow_downwardDownload

Análise De Desempenho De Um Algoritmo De Esqueletização De Imagens Em Arquitetura Nvidia Cuda

by João Eduardo Tozzi de Souza

2025

A utilização de Unidades de Processamento Gráfico para Propósito Geral (GPGPU) tem crescido muito nos últimos anos. Uma das arquiteturas que se utilizam desse conceito é a arquitetura CUDA da Nvidia, que consegue aumentos significativos... more

descriptionView Paper arrow_downwardDownload

Микрополосковый фильтр на двухмодовых резонаторах

by Б Беляев

2025, Актуальные проблемы авиации и космонавтики

descriptionView Paper arrow_downwardDownload

Полосно-пропускающие фильтры на основе фотонных кристаллов

by Б Беляев

2025, Актуальные проблемы авиации и космонавтики

Ключевые слова: полосно-пропускающий фильтр, фотонный кристалл, микрополосковое устройство.

descriptionView Paper arrow_downwardDownload

The Recursive Prime Harmonic Framework: Unifying 0-∞ Duality, Fractal Patterns, and Number Theory

by Richard Bolt

2025, WhatIf.Rocks

We present a formal synthesis of recently proposed concepts in recursive prime theory and harmonic analysis with classical principles of number theory and fractal geometry. Building on the frameworks introduced in "The Hidden Simplicity... more

descriptionView Paper arrow_downwardDownload

Una implementación paralela de las Transformadas DCT y DST en GPU

by erica de sa

2025

Se analiza la performance, y el grado de aceleración obtenido en relación a la solución secuencial. A modo de referencia, se compara con la paralelización en un cluster de multicores. Como herramientas de programación se utilizaron CUDA... more

descriptionView Paper arrow_downwardDownload

Avaliação de desempenho do sistema de memória heterogênea da arquitetura Intel Knights Landing (KNL)

by Silvio Stanzani

2025

We present an evaluation of the heterogeneous memory system of the Intel Xeon Phi KNL architecture, using applications with di!erent characteristics. Applications that perform many data transfer operations from/to to main memory, when... more

descriptionView Paper arrow_downwardDownload

Implementações Paralelas para Fecho Transitivo

by Raphael de Aquino Gomes

2025

A computação do fecho transitivo de um grafo é um problema que foi considerado pela primeira vez em 1959. Muitos algoritmos sequenciais para solução deste problema foram propostos e algoritmos paralelos foram considerados a partir de... more

descriptionView Paper arrow_downwardDownload

Numerical simulation of systems of rigid bodies

by Rufus S Neethling

2025

A collision involving only two bodies. -A collision scenario where more than two bodies are involved concurrently. -Concurrent collisions sharing at least one common rigid body number in their collision pair lists. -Collisions occurring at exactly the same instant in time. -A template class designed to be able to contain any type of data in a generic manner. -A measure of the similarity between solution values in two successive iterations. -Values obtained by successive iterations are tending towards specific values. -Sets of standardised programming solutions to often encountered problems. -Iteration results not settling on any specific values but rather appearing to change by ever greater values. -A cluster collision involving seven rigid bodies. -A cluster collision involving six rigid bodies. -Repetitive calculation or algorithm execution that stops when a certain termination condition is met. -A three-dimensional array of values or objects. -A two-dimensional array of values or objects. -A function providing a measure of the success or reliability of a solution to an optimisation problem when the solution is substituted into the expression. -A cluster collision involving eight rigid bodies. -A computational problem with the aim of finding an optimal or best solution for a set of requirements. -If a vector is oriented exactly perpendicular to some reference vector, it is said to be orthogonal to that reference vector and it implies that their scalar product (dot product) should be zero. -Profile drag coefficient tensor for body i. -Tangential drag coefficient tensor for body i. -Point i to facet j perpendicular projection distance. -Inter-object distance between geometrical objects i and j. -Infinitesimal volume for rigid body i. -Force vector acting upon body i. -Mass moment of inertia tensor for body i. -Mass moment of inertia tensor entry in row j, column k for body i . -Mass of body i. -Mass tensor for body i. -Surface normal vector for facet i. -General contact normal vector between object i and j (pointing to i). -Contact normal vector between object i and j (pointing to 9 at contact number nb. -Number of evenly spaced bins in Cartesian axis direction k. -Plane normal vector for any arbitrary plane to be intersected by an edge. -Orthonormal tangential vector between object i and j at contact number nb. -Any arbitrary point position vector. -Perpendicularly projected point from point i to edge j. -Perpendicularly projected point from point i to geometric entity j. -Perpendicularly projected point from facet i's vertex k to its opposite facet edge. -Point on any arbitrary plane to be intersected by an edge. -General point on and edge. -Radius of spherical body i. -Radius of spherical body i for contact number nb. -Any radial position relative to a reference point in metres ( m ) -Radial position vector from body i centre of mass ( 5 , ) to contact number nb. -Component i of a general radius relative to a fixed point ( d ) . -General radius relative to the centre of mass for any body. -Position radius of body i relative to the system centre of mass (A, ). -Tangential vector between object i and j at contact number nb. -Torque vector acting upon body i. -Total volume of rigid body i. -Centre of mass position vector for total rigid body system. -Centre of mass position vector for body i. -Position vector for first vertex of edge i. -Position vector for second vertex of edge i. -Position vector for first vertex of facet i. -Position vector for second vertex of facet i. -Position vector for third vertex of facet i. -Position vector for vertex i. -Any general linear velocity in metres per second ( m . s l ) . -Centre of mass (x_) average linear velocity for total rigid body system. -Time derivative of centre of mass position vector for body i. -Velocity of object i relative to object j. -Normal component of velocity of object i relative to object j. -Tangential component of velocity of object i relative to objectj. -Resultant velocity due to linear (&) and angular (8) velocities at given radius ( r ). a,, -Point i to-edge j perpendicular projection ratio. -Point-to-edge projection ratio from point i's perpendicular line k of facet j. -Original sorting bin division size in Cartesian axis direction k. A,,,,+, -Sorting bm division overlap in Cartesian axis direction k. -The time step adjustment to be made during a critical time search iteration process. At,,. -The time step adjustment to be made during a critical time search iteration process.

descriptionView Paper arrow_downwardDownload

Evaluación de algoritmos supervisados de extracción de características para clasificación de texturas

by Guillermo Ulises Cantero Miranda

2025

Dichos métodos consisten en procesar el espectro de las imágenes por un banco de filtros para, a partir de ahí, extraer las características que más información proporcionen para la posterior fase de clasificación. Concretamente, se... more

descriptionView Paper arrow_downwardDownload

Simulación De Enfermedades Infecciosas en Grandes Poblaciones a Través De Un Autómata Celular Estocástico Paralelizado Por Gpu Con C-Cuda

by Adrian E Trueba

2025

In Science a large number of areas are being benefited by the reduction of computational time with the use of Graphics Processing Units (GPU). In the case of Epidemiology through the speeding of the simulation of scenarios with large... more

descriptionView Paper arrow_downwardDownload

HPC simulations of brownout: A noninteracting particles dynamic model

by Nicola Parolini

2025, The International Journal of High Performance Computing Applications

Helicopters can experience brownout when flying close to a dusty surface. The uplifting of dust in the air can remarkably restrict the pilot’s visibility area. Consequently, a brownout can disorient the pilot and lead to the helicopter... more

descriptionView Paper arrow_downwardDownload

Survey on Particle Swarm Optimization accelerated on GPGPU

by Joanna Kołodziejczyk

2025, International Journal of Scientific and Engineering Research

The paper presents an overview of recent research on the Particle Swarm Optimization (PSO) algorithm parallelization on the Graphics Processing Unit for general-purpose computations (GPGPU). This survey attempts to collect, organize, and... more

descriptionView Paper arrow_downwardDownload

Unlocking AI's Potential: The Secret Behind DeepSeek's Lightning-Fast Code Generation

by Flávio de J Ávila

2025

In the rapidly evolving landscape of artificial intelligence, code generation has emerged as a critical frontier for automating software development, optimizing workflows, and democratizing programming expertise. Among the pioneers in... more

descriptionView Paper arrow_downwardDownload

Comparative assessment of GPU-Accelerated VS. CPU-based databases: Architecture, performance, and cost implications

by Arvind T

2025, International Journal of Cloud Computing and Database Management

The current research paper aims to outline the capabilities of GP-GPU accelerated parallel computing for database operations and compare it with Central Processing Units (CPU)-based methodologies. The study evaluates the performance,... more

descriptionView Paper arrow_downwardDownload

Quantum computer simulation using the CUDA programming model

by Sergio Andrés Porras Romero

2025, Computer Physics Communications

descriptionView Paper arrow_downwardDownload

KUDA: GPU Accelerated Split Race Checker

by Can Bekar

2025, Workshop on Determinism and Correctness in Parallel Programming (WoDet), London, England, UK

We propose a novel approach for runtime verification on computers with a large number of computation cores, without any hardware extension to mainstream PC environment. The goal of the approach is making use of all hardware resources to... more

descriptionView Paper arrow_downwardDownload

Speeding up Slow Monte Carlo Simulations using Parallel Computing Techniques

by Micha den Heijer

2025

This thesis examines the parallelization techniques that can be used to speed up econometric computations. Specifically, Monte Carlo simulations implemented using the R programming language. Parallelized Monte Carlo simulations can use... more

descriptionView Paper arrow_downwardDownload

Ludwig Klages und die Ethik der Erde im frühen 20. Jahrhundert, in: Sophia Gräfe / Georg Toepfer (Hg.): Wissensgeschichte des Verhaltens, Berlin: De Gruyter 2025, S. 65-80.

by Leander Scholz

2025

Am zweiten Oktober-Wochenende des Jahres 1913 versammelten sich weit über zweitausend junge Männer und Frauen auf dem Hohen Meißner in Hessen. Das Treffen fand auf Einladung einer losen Vereinigung von Jugendbünden statt und sollte den patriotischen Übersteigerungen des Deutschen Reiches etwas entgegensetzen. Denn dieser Oktober war zugleich Schauplatz offizieller Festakte zum hundertjährigen Jubiläum der Völkerschlacht bei Leipzig, in der die napoleonischen Truppen ihre entscheidende Niederlage erlitten hatten. Russland, Preußen, Österreich und Schweden hatten Frankreich besiegt und dem bislang erfolgreichen Eroberungszug seiner revolutionären Armee unter Napoleon Bonaparte damit ein Ende gesetzt. Anlässlich des feierlichen Rückblicks auf dieses historische Ereignis sollte durch den Deutschen Kaiser und die Bundesfürsten ein monumentales Denkmal eingeweiht werden, zur ewigen Erinnerung an die große Befreiungsschlacht des europäischen Nordens gegen den europäischen Süden, zumal eine Auseinandersetzung feindlicher Lager jederzeit erneut bevorstehen konnte. Denn die Zeremonie der Einweihung stand bereits im Zeichen eines Krieges, der als erster von zwei Weltkriegen in die Geschichte eingehen sollte. Zu den jugendlichen Gegnern der Reichspolitik gehörten die Vereinigungen der Wandervögel und Lebensreformer, die sich für einen fundamentalen politischen Wandel einsetzten. Ihr Ziel war es, die Geschlossenheit einer Jugendbewegung zu demonstrieren, die sich von den alten Feindschaften aus der Gründerzeit losgesagt und diese überwunden hatte. Die Alternativveranstaltung zur kaiserlichen Jubiläumsfeier war daher als ein "Fest der Jugend" gedacht, mit Verweis auf den politischen Aufbruch der frühen Wartburgfeste im neunzehnten Jahrhundert, die ebenfalls Protestkundgebungen einer emanzipierten, gegen reaktionäre Politik und Kleinstaaterei aufbegehrenden Jugend gewesen waren. Zwar fand auch das "Fest der Jugend" anlässlich des Gedenkens an den militärischen Erfolg von 1813 statt, aber der Sieg wurde von den Jugendbünden nicht nur als Befreiung von den französischen Besatzern begriffen, sondern auch als Befreiung von der deutschen Kleinstaaterei und den alten Autoritäten. Es bezeichnet den demokratischen Auftakt einer jungen Nation, die sich erst noch zu finden hatte. Mit den beiden konkurrierenden Veranstaltungen im Oktober 1913 stand die deutsche Nation sich in ihren entscheidenden Grundzügen selbst gegenüber.

descriptionView Paper arrow_downwardDownload

ALICE HLT High Speed Tracking on GPU

by Arshad Ahmad Masoodi

2025, IEEE Transactions on Nuclear Science

The on-line event reconstruction in ALICE is performed by the High Level Trigger, which should process up to 2000 events per second in proton-proton collisions and up to 300 central events per second in heavy-ion collisions, corresponding... more

descriptionView Paper arrow_downwardDownload

More Data Locality for Static Control Programs on NUMA Architectures

by Claude Tadonki

2024, HAL (Le Centre pour la Communication Scientifique Directe)

The polyhedral model is powerful for analyzing and transforming static control programs, hence its intensive use for the optimization of data locality and automatic parallelization. Affine transformations excel at modeling control flow,... more

descriptionView Paper arrow_downwardDownload

Performance Analysis of Parallel Programs with RAPIDS as a Framework of Execution Easychair

by Seyi T O P E Ogunji

2024, EPiC Series in Computing Volume 104, 2024, Pages 243–267 Proceedings of 3rd International Workshop on Mathematical Modeling and Scientific Computing

In this age where data is growing at an astronomical rate, with unfettered access to digital information, complexities have been introduced to scientific computations, analysis, and inferences. This is because such data could not be easily processed with traditional approaches. However, with innovative designs brought to the fore by NVIDIA and other market players in recent times, there have been productions of state-of-the-art GPUs such as NVIDIA A100 Tensor Core GPU, Tesla V100, and NVIDIA H100 that seamlessly handle complex mathematical simulations and computations, artificial intelligence, machine learning, and high-performance computing, producing highly improved speed and effi
ciency, with room for scalability. These innovations have made it possible to efficiently deploy many parallel programming models like shared memory, distributed memory, data parallelization, and Partitioned Global Address Space (PGAS) with high-performance metrics. In this work, we analyzed the parquet-formatted New York City yellow taxi dataset
on a RAPIDS and DASK supported distributed data-parallel training platform using a high-performance cluster of 7 NVIDIA TITAN RTX GPUs (24GB GDDR6 each) running CUDA 12.4. The dataset was used to train Extreme Gradient Boosting (XGBoost), RandomForest Regressor, and Elastic Net models for trip fare predictions. Our models achieved notable performance metrics. The XGBoost achieved a mean squared error (MSE) of 10.87, R2 of 96.9%, and a training time of 21.1 seconds despite the huge size of the training dataset, showing how computationally efficient the system was. RandomForest achieved MSE of 27.46, R2 of 92.2% and a training time of 25.9 seconds. In the bid to show the scalability and versatility of our experimental design to different machine learning domains, our multi-GPU accelerated training was extended to image classification tasks by using MobileNet-V3-Large pre-trained architecture on a CIFAR-100 dataset. The following parallelization results were achieved: a low Karp-Flatt metric of 0.013, indicating minimal serialization, 98.7% parallel fraction, demonstrating excellent parallelization, and only 7.1% communication overhead relative to computation time. For the model performance, we achieved a ROC AUC of over 95% for the implementation. This work advances the state-of-the-art in parallel computing through implementation of RAPIDS and DASK frameworks on a distributed data-parallel training platform making use of NVIDIA multi-GPUs. The work is built on a well established theoretical framework using Amdahl and Gustafson’s laws on parallel computation. By integrating RAPIDS and DASK, we contribute to advancing parallel computing capabilities, offering potential applications in smart city development and the field of logistics and transportation management services where rapid fare predictions are very important. The contribution could also be xtended
to the field of image classification, vision systems, object detection and embedded systems for mobile applications.

descriptionView Paper arrow_downwardDownload

Traversing large graphs on GPUs with unified memory

by Hyesoon Kim

2024, Proceedings of the VLDB Endowment

Due to the limited capacity of GPU memory, the majority of prior work on graph applications on GPUs has been restricted to graphs of modest sizes that fit in memory. Recent hardware and software advances make it possible to address much... more

descriptionView Paper arrow_downwardDownload

Traversing large graphs on GPUs with unified memory

by Hyesoon Kim

2024, Proceedings of the VLDB Endowment

descriptionView Paper arrow_downwardDownload

Energy-dissipation splitting finite-difference time-domain method for Maxwell equations with perfectly matched layers

by Linghua Kong

2024, Journal of Computational Physics

In this paper, we develop a novel kind of energy-dissipation splitting finite-difference timedomain scheme for solving two-dimensional Maxwell equations with perfectly matched layers. The discrete energy dissipation law, convergence,... more

descriptionView Paper arrow_downwardDownload

Reverse Engineering Digital Forensics

by Rodrigo Lopes

2024

Engineering is many times described as making practical application of the knowledge of pure sciences in the solution of a problem or the application of scientific and mathematical principles to develop economical solutions to technical... more

descriptionView Paper arrow_downwardDownload

Digital filter bank implementation in hydroacoustic monitoring tasks

by Slava Prestigh

2024, PRZEGL�D ELEKTROTECHNICZNY

The paper discusses digital filter bank implementation using the weighted overlap-add (WOLA) algorithm modification introduced in the context of multichannel signal processing. The suggested modification is applied to hydroacoustic... more

descriptionView Paper arrow_downwardDownload

Realtime Dense Stereo Matching with Dynamic Programming in CUDA

by Oscar Ruiz-Salguero

2024, CEIG

Real-time depth extraction from stereo images is an important process in computer vision. This paper proposes a new implementation of the dynamic programming algorithm to calculate dense depth maps using the CUDA architecture achieving... more

descriptionView Paper arrow_downwardDownload

Procesamiento de señales SAR en GPGPU

by Javier Areta

2024

Este trabajo tiene como objetivo presentar las principales características del diseño e implementación de algoritmos paralelos para el procesamiento de señales de radar de apertura sintétitca (SAR). Se analizan las razones por las que el... more

descriptionView Paper arrow_downwardDownload

Compute Unified Device Architecture NVIDIA CUDA

Key research themes

1. How can CUDA optimize parallel image convolution computations to enhance GPU performance?

2. How can parallel GPU programming models, including CUDA, HIP, and OpenACC, be evaluated and optimized for performance portability across heterogeneous GPU architectures?

3. What methods improve software intellectual property protection and enable reverse engineering analysis for CUDA applications?

4. How can GPU-accelerated frameworks like RAPIDS and multi-GPU CUDA programming improve data-parallel machine learning workloads?

Related Topics

All papers in Compute Unified Device Architecture NVIDIA CUDA