高速計算ライブラリ - flatlineの日記跡地

id:flappphys:20050430#p1 の続き．

ATLAS

ATLAS: Automatically Tuned Linear Algebra Software http://math-atlas.sourceforge.net/ は，ひとことで言えばBLASのよくできた実装だ．BLASの要る応用ソフトウェアを走らせるにはこいつがあればとりあえずどうにかなる（微妙に細部が違ったりするけど）．最近ではこちらがほぼメインになってきたので，始めからこちらに言及している応用ソフトウェアも多い．
BLASの実装を，計算機の性能を十分引き出す，効率の良いものにするには，数値計算と計算機アーキテクチャ両方の知識を持った技術者を多数動員してコーディングを行わせる必要がある．しかも例えばIntel XeonとAMD Opteronとでは，それぞれの特性に合わせて異なる実装をする必要があるので，工数はますます上がる．要するにやってられない．そこを自動化しようとする研究プロジェクト（とその産物たるライブラリ）がATLASだ．
ATLASにおいては，ある計算を行うアルゴリズムは複数通り用意されている．そのうちでコンパイル時に最も高速に動作するものが（サンプルプログラムの動作を基に）自動的に選別される仕組みだ*1．レジスタの活用もさることながら，主にキャッシュの活用に重点を置いている．ATLASはそうやって自動生成されたCソースからコンパイルされる．現在のところ，BLASのAPI全てと，下で述べるLAPACKのAPI数個が実装されている．ただしBLAS Level 1では自動生成テクニックは使われていない．FortranコードからATLASを呼び出すこともできる．詳しいことは作者のレポートを参照してほしい： Whaley, Petitet, and Dongarra, Automated Empirical Optimization of Software and the ATLAS Project, http://www.netlib.org/lapack/lawns/lawn147.ps （今どきPostScriptかよ...）

LAPACK

こいつが本命（？）． LAPACK: Linear Algebra PACKage http://phase.hpcc.jp/mirrors/netlib/lapack/

LAPACK is written in Fortran77 and provides routines for solving systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, and singular value problems. The associated matrix factorizations (LU, Cholesky, QR, SVD, Schur, generalized Schur) are also provided, as are related computations such as reordering of the Schur factorizations and estimating condition numbers. Dense and banded matrices are handled, but not general sparse matrices. In all areas, similar functionality is provided for real and complex matrices, in both single and double precision.
If you're uncertain of the LAPACK routine name to address your application's needs, check out the LAPACK Search Engine.
The original goal of the LAPACK project was to make the widely used EISPACK and LINPACK libraries run efficiently on shared-memory vector and parallel processors. On these machines, LINPACK and EISPACK are inefficient because their memory access patterns disregard the multi-layered memory hierarchies of the machines, thereby spending too much time moving data instead of doing useful floating-point operations. LAPACK addresses this problem by reorganizing the algorithms to use block matrix operations, such as matrix multiplication, in the innermost loops. These block operations can be optimized for each architecture to account for the memory hierarchy, and so provide a transportable way to achieve high efficiency on diverse modern machines. We use the term "transportable" instead of "portable" because, for fastest possible performance, LAPACK requires that highly optimized block matrix operations be already implemented on each machine.
LAPACK routines are written so that as much as possible of the computation is performed by calls to the Basic Linear Algebra Subprograms (BLAS). While LINPACK and EISPACK are based on the vector operation kernels of the Level 1 BLAS, LAPACK was designed at the outset to exploit the Level 3 BLAS ---a set of specifications for Fortran subprograms that do various types of matrix multiplication and the solution of triangular systems with multiple right-hand sides. Because of the coarse granularity of the Level 3 BLAS operations, their use promotes high efficiency on many high-performance computers, particularly if specially coded implementations are provided by the manufacturer.
Highly efficient machine-specific implementations of the BLAS are available for many modern high-performance computers. For details of known vendor- or ISV-provided BLAS, consult the BLAS FAQ. Alternatively, the user can download ATLAS to automatically generate an optimized BLAS library for the architecture. A Fortran77 reference implementation of the BLAS in available from netlib; however, its use is discouraged as it will not perform as well as a specially tuned implementation.

（訳）
LAPACKはFortran77で書かれており，線形方程式系の解や最小二乗解を求めたり，固有値問題，特異値問題を解くためのルーチンを提供します．それに関連する行列の分解手法 (LU, Cholesky, QR, SVD, Schur, generalized Schur) に加え， Schur分解の並べ替えや条件数の見積もり等の計算も提供されています．密行列と帯行列は扱えますが，一般の疎行列はうまく扱えません．全てにおいて，実行列と複素行列に対して同様の機能が，単精度と倍精度の両方で提供されています．
アプリケーションで使いたいLAPACKルーチンの名前が分からないなら， LAPACKサーチエンジンをどうぞ．
LAPACKプロジェクトの元々の目標は，広く使われていたライブラリ，EISPACKやLINPACKを，共有メモリを持つベクトルまたは並列プロセッサ上で効率良く動かすことでした． LINPACKやEISPACKでは，メモリアクセスのパターンがマシンのメモリ階層を考慮しておらず，意味のある浮動小数点演算ではなくデータの移送に時間がかかり過ぎていたので，そういったマシン上では効率が悪かったのです． LAPACKは行列の乗法などのアルゴリズムを再編成し，最内ループでブロック行列に基づく処理を行うことでこの問題を解決しました．これらのブロック処理は各アーキテクチャ毎にメモリ階層を考慮して最適化でき，それにより多岐に渡る最新型マシンにおいて高い効率を実現するための「運搬可能な」方法を提供できます．「可搬性のある」でなく「運搬可能な」という言葉遣いをしたのは，可能なかぎり高いパフォーマンスのためには，高度に最適化されたブロック行列処理が各マシンで予め実装されている必要があるからです．
LAPACKルーチンは，計算のできる限り多くが Basic Linear Algebra Subprograms (BLAS) を呼び出すことでなされるように設計されています． LINPACKとEISPACKはBLAS Level 1のベクトル計算カーネルに基づいていますが， LAPACKは始めからBLAS Level 3を活用するように設計されました． ---それは様々な行列の乗算を行ったり，複数の定数ベクトルに対する三角行列系を解くための Fortranサブプログラムの仕様一式です． BLAS Level 3の処理の粒度は荒いので，それらを使うことで多くの高速コンピュータにおいて，特に製造メーカから特別な実装が得られる場合に，高い効率を得られます．
とても効率の良い，マシン毎のBLASの実装が，多くの新しい高性能コンピュータに対して入手可能です．計算機ヴェンダもしくはサードパーティの提供するBLASの詳細については BLAS FAQを参照してください． BLASの代わりに，ATLASをダウンロードし，アーキテクチャに対して最適化されたBLASライブラリを自動生成させることもできます． Fortran77によるBLASの参照実装がnetlibから手に入りますが，特別にチューニングした実装ほどには効率が出ないので，なるべく使わない方がよいでしょう．

サブルーチンの名前がどうのという話が出てきたが，LAPACKルーチンの名前はまさに "good old Fortran" といった感じで，素人には一体何なのかさっぱり分からない*2．DPTSVXでDouble precisionのPositive definiteかつTridiagonalな行列を係数に持つ線形方程式系をSolVeするeXpert向けルーチンを表す... アヒャ！
なお，LAPACKサイトトップに描かれた行列はマニアックな笑いを提供してくれる... かもしれない．

*1:ただしサンプルの動作計測も注意しないと誤った結論を導きがちだし，いくつもの選択肢で全ての組合せを試していたらいつまでたってもコンパイルが終らなくなるので，その辺をきちんと解決する工夫がされている．

*2:これはFortran 77が識別子にアルファベット6文字までしか許容していなかったせい．大文字小文字の区別のないアルファベットを表すために6bit，それが6文字で36bit．それが1wordだったマシンが多かったのだろう．