《大规模并行处理器程序设计（英文版第2版,权威专家联袂编写，并行编程方面权威著作）》 - [美]柯克 - 机械工业出版社 - 香港大書城 - Meg Book Store

	登入帳戶　 \|　訂單查詢　 \|　購物車/收銀台(0)　\|　在線留言板　 \|　付款方式　 \|　運費計算　 \|　聯絡我們　 \|　幫助中心　\|　加入書簽
		會員登入新用戶登記

HOME

新書上架

暢銷書架

好書推介

會員書架精選

2023年度TOP

臺灣用戶

品種：超過100萬種各類書籍/音像和精品，正品正價，放心網購，悭钱省心

服務：香港／台灣／澳門／海外

送貨：速遞／郵局／服務站

新書上架：簡體書繁體書
暢銷書架：簡體書繁體書
好書推介：簡體書繁體書

十月出版：大陸書台灣書
九月出版：大陸書台灣書
八月出版：大陸書台灣書
七月出版：大陸書台灣書
六月出版：大陸書台灣書
五月出版：大陸書台灣書
四月出版：大陸書台灣書
三月出版：大陸書台灣書
二月出版：大陸書台灣書
一月出版：大陸書台灣書
12月出版：大陸書台灣書
11月出版：大陸書台灣書
十月出版：大陸書台灣書
九月出版：大陸書台灣書
八月出版：大陸書台灣書

『簡體書』大规模并行处理器程序设计（英文版第2版,权威专家联袂编写，并行编程方面权威著作）

書城自編碼： 2048308
分類：簡體書→大陸圖書→計算機/網絡→程序設計
作者： [美]柯克
國際書號(ISBN)： 9787111416296
出版社：机械工业出版社
出版日期： 2013-03-01
版次： 1 印次： 1
頁數/字數： 496/
書度/開本： 16开釘裝：平装

售價：HK$ 209.4

我要買件

** 我創建的書架 **
未登入.

新書推薦：

Python贝叶斯深度学习

《 Python贝叶斯深度学习》
售價：HK$ 89.4

文本的密码：社会语境中的宋代文学

《文本的密码：社会语境中的宋代文学》
售價：HK$ 67.2

启微·狂骉年代：西洋赛马在中国

《启微·狂骉年代：西洋赛马在中国》
售價：HK$ 78.4

有趣的中国古建筑

《有趣的中国古建筑》
售價：HK$ 67.0

十一年夏至

《十一年夏至》
售價：HK$ 76.2

如何打造成功的商业赛事

《如何打造成功的商业赛事》
售價：HK$ 89.5

万千教育学前·透视学前儿童的发展：解析幼儿教师常问的那些问题

《万千教育学前·透视学前儿童的发展：解析幼儿教师常问的那些问题》
售價：HK$ 58.2

慈悲与玫瑰

《慈悲与玫瑰》
售價：HK$ 87.4

建議一齊購買：

+

HK$ 218.3
《精通iOS开发（第7版）》

+

HK$ 205.9
《Visual Basic从入门到精通（第3版）（附光盘1张）》

內容簡介：

　
《经典原版书库：大规模并行处理器程序设计（英文版.第2版）》内容简介：作者结合自己多年从事并行计算课程教学的经验，以简洁、直观和实用的方式，详细剖析了编写并行程序所需的各种技术，并用丰富的案例说明了并行程序设计的整个开发过程，即从计算机思想开始，直到最终实现高效可行的并行程序。
　
与上一版相比，本版对书中内容进行全面修订和更新，更加系统地阐述并行程序设计，既介绍了基本并行算法模式，又补充了更多的背景资料，而且还介绍了一些新的实用编程技术和工具。具体更新情况如下：
　并行模式：新增3章并行模式方面的内容，详细说明了并行应用中涉及的诸多算法。
　 cuda fortran：这一章简要介绍了针对cuda体系结构的编程接口，并通过丰富的实例阐释cuda编程。
　 openacc：这一章介绍了使用指令表示并行性的开放标准，以简化并行编程任务。
　 thrust：thrust是cuda
c／c++之上的一个抽象层。本版用一章的篇幅说明了如何利用thrust并行模板库以最少的编程工作来实现高性能应用。
　 c++amp：微软开发的一种编程接口，用于简化windows环境中大规模并行处理编程。
　 nvidia的kepler架构：探讨了nvidia高性能、节能的gpu架构的编程特性。

關於作者：

　 David B．Kirk美国国家工程院院士、NVIDIA
Fellow，曾是NVIDIA公司首席科学家。他领导了nvidia图形技术开发，并使其成为当今最流行的大众娱乐平台，也是cuda技术的创始人之一。2002年，他荣获ACM
SIGGRAPH计算机图形成就奖，以表彰其在把高性能计算机图形系统推向大众市场方面所做出的杰出贡献。他拥有麻省理工学院的机械工程学学士学位和硕士学位，加州理工学院的计算机科学博士学位。kirk博士是50项与图形芯片设计相关的专利和专利申请的发明者，发表了50多篇关于图形处理技术的论文，是可视化计算技术方面的权威。
　 Wen-Mei
W．Hwu胡文美拥有美国加州大学伯克利分校计算机科学博士学位，现任美国伊利诺伊大学厄巴纳—香槟分校UIUC协调科学实验室电气与计算机工程Jerry
SandersAMD创始人讲座教授、微软和英特尔联合资助的通用并行计算研究中心联合主任兼世界上第一个NVIDIA
CUDA卓越中心首席研究员。胡教授是世界顶级的并行处理器架构与编译器专家，担任美国下一代千万亿级计算机——蓝水系统的首席研究员。他是IEEE
Fellow、ACM Fellow。

目錄：

preface
acknowledgements
chapter 1 introduction
1.1 heterogeneous parallel computing
1.2 architecture of a modem gpu
1.3 why more speed or parallelism?
1.4 speeding up real applications
1.5 parallel programming languages and models
1.6 overarching goals
1.7 organization of the book
references
chapter 2 history of gpu computing
2.1 evolution of graphics pipelines
the era of fixed-function graphics pipelines
evolution of programmable real-time graphics
unified graphics and computing processors
2.2 gpgpu: an intermediate step
2.3 gpu computing
scalable gpus
recent developments
future trends
references and further reading
chapter 3 introduction to data parallelism and coda c
3.1 data parallelism
3.2 cuda program structure
3.3 a vector addition kernel
3.4 device global memory and data transfer
3.5 kernel functions and threading
3.6 summary
function declarations
kernel launch
predefined variables
runtime api
3.7 exercises
references
chapter 4 data-parallel execution model
4.1 cuda thread organization
4.2 mapping threads to multidimensional data
4.3 matrix-matrix multiplication--a more complex kernel
4.4 synchronization and transparent scalability
4.5 assigning resources to blocks
4.6 querying device properties
4.7 thread scheduling and latency tolerance
4.8 summary
4.9 exercises
chapter 5 coda memories
5.1 importance of memory access efficiency
5.2 cuda device memory types
5.3 a strategy for reducing global memory traffic
5.4 a tiled matrix-matrix multiplication kernel
5.5 memory as a limiting factor to parallelism
5.6 summary
5.7 exercises
chapter 6 performance considerations
6.1 warps and thread execution
6.2 global memory bandwidth
6.3 dynamic partitioning of execution resources
6.4 instruction mix and thread granularity
6.5 summary
6.6 exercises
references
chapter 7 floating-point considerations
7.1 floating-point format
normalized representation of m
excess encoding of e
7.2 representable numbers
7.3 special bit patterns and precision in ieee format
7.4 arithmetic accuracy and rounding
7.5 algorithm considerations
7.6 numerical stability
7.7 summary
7.8 exercises
references
chapter 8 parallel patterns: convolution
8.1 background
8.2 ID parallel convolution a basic algorithm
8.3 constant memory and caching
8.4 tiled 1d convolution with halo elements
8.5 a simpler tiled 1d convolution--general caching
8.6 summary
8.7 exercises
chapter 9 parallel patterns: prefix sum
9.1 background
9.2 a simple parallel scan
9.3 work efficiency considerations
9.4 a work-efficient parallel scan
9.5 parallel scan for arbitrary-length inputs
9.6 summary
9.7 exercises
reference
chapter 10 parallel patterns: sparse matrix-vector
multiplication
10.1 background
10.2 parallel spmv using csr
10.3 padding and transposition
10.4 using hybrid to control padding
10.5 sorting and partitioning for regularization
10.6 summary
10.7 exercises
references
chapter 11 application case study: advanced mri
reconstruction
11.1 application background
11.2 iterative reconstruction
11.3 computing fhd
step 1: determine the kernel parallelism structure
step 2: getting around the memory bandwidth limitation.
step 3: using hardware trigonometry functions
step 4: experimental performance tuning
11.4 final evaluation
11.5 exercises
references
chapter 12 application case study: molecular visualization and
analysis
12.1 application background
12.2 a simple kernel implementation
12.3 thread granularity adjustment
12.4 memory coalescing
12.5 summary
12.6 exercises
references
chapter 13 parallel programming and computational thinking
13.1 goals of parallel computing
13.2 problem decomposition
13.3 algorithm selection
13.4 computational thinking
13.5 summary
13.6 exercises
references
chapter 14 an introduction to opencltm
14.1 background
14.2 data parallelism model
14.3 device architecture
14.4 kernel functions
14.5 device management and kernel launch
14.6 electrostatic potential map in opencl
14.7 summary
14.8 exercises
references
chapter 15 parallel programming with openacc
15.1 0penacc versus cuda c
15.2 execution model
15.3 memory model
15.4 basic openacc programs
parallel construct
loop constmct
kernels construct
data management
asynchronous computation and data transfer
15.5 future directions of openacc
15.6 exercises
chapter 16 thrust: a productivity-oriented library for cuda
16.1 background
16.2 motivation
16.3 basic thrust features
iterators and memory space
interoperability
16.4 generic programming
16.5 benefits of abstraction
16.6 programmer productivity
robustness
real world performance
16.7 best practices
fusion
structure of arrays
implicit ranges
16.8 exercises
references
chapter 17 cuda fortran
17.1 cuda fortran and cuda c differences
17.2 a first cuda fortran program
17.3 multidimensional array in cuda fortran.
17.4 overloading hostdevice routines with generic interfaces
17.5 calling cuda c via iso_c_binding
17.6 kernel loop directives and reduction operations
17.7 dynamic shared memory
17.8 asynchronous data transfers
17.9 compilation and profiling
17.10 calling thrust from cuda fortran
17.11 exercises
chapter 18 an introduction to c + + amp
18.1 core c + + amp features
18.2 details of the c + + amp execution model
explicit and implicit data copies
asynchronous operation
section summary
18.3 managing accelerators
18.4 tiled execution
18.5 c + + amp graphics features
18.6 summary
18.7 exercises
chapter 19 programming a heterogeneous computing cluster
19.1 background
19.2 a running example
19.3 mpi basics
19.4 mpi point-to-point communication types
19.5 overlapping computation and communication
19.6 mpi collective communication
19.7 summary
19.8 exercises
reference
chapter 20 cuda dynamic parallelism
20.1 background
20.2 dynamic parallelism overview
20.3 important details
launch enviromnent configuration
apierrors and launch failures
events
streams
synchronization scope
20.4 memory visibility
global memory
zero-copy memory
constant memory
texture memory
20.5 a simple example
20.6 runtime limitations
memory footprint
nesting depth
memory allocation and lifetime
ecc errors
streams
events
launch pool
20.7 a more complex example
linear bezier curves
quadratic bezier curves
bezier curve calculation predynamic parallelism
bezier curve calculation with dynamic parallelism
20.8 summary
reference
chapter 21 conclusion and future outlook
21.1 goals revisited
21.2 memory model evolution
21.3 kernel execution control evolution
21.4 core performance
21.5 programming environment
21.6 future outlook
references
appendix A: matrix multiplication host-only version source
code
appendix B: gpu compute capabilities
index

書城介紹　 |　合作申請　|　索要書目　 |　新手入門　|　聯絡方式　 |　幫助中心　|　找書說明　 |　送貨方式　|　付款方式 香港用户　 |　台灣用户　|　大陸用户　|　海外用户

megBook.com.hk

Copyright © 2013 - 2024 （香港）大書城有限公司　 All Rights Reserved.