Tag Archives: linpack

怎样测量算法的性能?

  先写一句题外话,我是玩社交网络的,例如豆瓣LinkedIn42区,俺只是不玩新浪微博而已。

  LinkedIn的多核与并行计算讨论组(Multicore & Parallel Computing Group)里,刚刚发生了一次讨论,内容很有价值,尤其是对刚踏入这个领域的新人而言。所以转载到BLOG上。

  一开始,是印度小姑娘Ratika Agarwal问了个基本问题:

 

  How to measure the performance of any working algorithm? do any body know?

 

  Darryl Gove提到了各种经典工具,例如Linux下的Oprofile、Solaris下的DTrace和MDB、AMD的Code Analyst,还有Intel的VTune:

 

  If, by performance, you mean “How long does it take”, then you could use the time command, print out the number of units of work completed and divide one by the other. However, that doesn’t guaranty that your implementation of the algorithm is efficient.

  So I’d recommend profiling. On Linux look at Oprofile, on Solaris (or Linux) you can use the performance analyzer from Solaris Studio (just announced the beta programme for 12.3), on Windows with AMD chips they have a Code Analyst, Intel you probably want to look at VTune.

  Most of these tools will give you time based profiles and hardware performance counter data. So you can look into things like the instruction count, cache misses etc.

 

  John Lilley又提到了Visual Studio Team edition里面的profiler工具。他总结了影响性能的主要因素,优化的时候往往是在各个因素之间寻求平衡。

 

  Also on Windows, Visual Studio Team edition 2008 and 2010 has a built-in code profiler for native code. I think you might get managed code profiling in the Professional versions; however I am unsure about that. However, there are many dimensions to consider when talking about “performance” in addition to CPU. For example, you may want any of the following to optimize:

  1) Memory usage

  2) Disk usage

  3) Wall-clock time (as opposed to CPU loading, which is what a profiler will give you)

  There are many examples of such tradeoffs. For example consider the problem of sorting a file of records. The fastest sort algorithm will probably load the entire file into memory, perform a parallel sort that utilizes all CPU cores as much as possible, and will probably make a complete copy of the data (or perhaps just a copy of pointers to the data), and then write the output file. However, there are many tradeoffs to consider. What if the file is 100GB and you only have 8GB of memory? Then you must store some intermediate results on disk. You might then consider the CPU/disk tradeoff of compressing the temporary data. These are just some suggestions.

 

  Jim Dempsey进一步提醒,优化更多是一项工程活儿,程序性能往往受到具体环境影响:如果数据集中在内存里,那么提高缓存的本地性往往比减少需执行的指令数更重要;而如果面临I/O密集型的应用,选择适当的负载均衡技术是最重要的。

 

  The performance of a given algorithm is not necessarily an absolute property. In reading the prior posts you will find that there are various factors that can affect performance and therefore one algorithm is not always superior to another without considering the runtime circumstances. When your working dataset fits within memory then paying attention to improving cache locality may be more important than reducing instruction count. When I/O is involved then choosing a technique that permits computation overlapped with I/O usually is superior. Overlapping I/O or sequencing the data that is fed into an algorithm is generally not considered part of “the algorithm” but may have more of an impact on application performance than the algorithm itself.

 

  John Lilley顶贴,对此表示同意。

 

  Jim is absolutely right. The “performance” of an algorithm must be judged within the context of the program’s requirements. The kind of performance you want out of a program on a multi-CPU server with virtual memory and a fast SAN might be completely different than what you want from a similar algorithm running on a smart phone. On the server, you might optimize for performance and scalability, On the smartphone, you might optimize for power conservation, memory, and storage use.

 

  Ratika Agarwal表示压力很大,自己是个新手,不知道从哪里开始。

 

  But I’m a beginner.I’m just going to start implementing the comparative analysis of algorithms.actually I need to do a project for my M.Tech thesis.

  Thus I am thinking about, to construct some parallel algorithms of the existing sequential ones. Is that the right path what i am following, for my M.Tech project and thesis?

  Also I am a very new user for parallel programming. Don’t know from where to start.

 

  于是John Lilley给她推荐了书和其他学习资料。

 

  This book: http://books.google.com/books/about/The_Art_of_Concurrency.html?id=rU68SYVS7S8C is a good primer to parallel programming, if you have access. Also read Dmitry’s blog: http://www.1024cores.net. A goodle search on “parallel programming blog” yields good forums in the top 10. Depending on your programming language of choice, I would use “Thread Building Blocks” from Intel for C++, or Cilk++, or if you are C#/.Net, use PLINQ and TPL. On the Mac, there is Grand Central Dispatch. I’m not sure what Java people use. The maintream programming wisdom today is to think in terms of sub-tasks to execute in parallel, not in terms of threads. The aforementioned tools are all parallel task models. This is a concise illustration of parallelizing quicksort using cilk++: http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/cpp/lin/cref_cls/common/cilk_convertcpp.htm. There are many interesting problems out there that that benefit from efficient parallel implementations, especially in the realm of graph analysis, sorting, searching, indexing, and so on.

 

  Jeff Kenton又给了些建议。

 

  Your original question was about ” how to measure the performance”. So, what you need to know is how much faster your parallel algorithm is than the sequential one? And how well does it scale with the addition of more cores / threads? The bottlenecks will likely turn out to be communication between the pieces, and sequential parts that you can’t eliminate.

  You will need to choose a suitable problem to solve. Implement it sequentially first so that you understand the simple version. Then decide what can be done in parallel and how to combine and communicate the results at the end. And also figure out how you will measure your results and how you will instrument the various parts so that you understand what the issues are.

  Good luck.

 

  Christopher Aziz提到了LINPACK指标和openMP。

 

  ”performance measurement” is an important, frequently misrepresented and often misunderstood topic usually associated with “timing”. Notionally, “timing” a race is simple. You click the stop watch when the starter’s gun fires and then stop it when the first runner or horse crosses the finish line. This is the lay person’s view: ” Where is the hard part?”. Certainly not in this simplistic notion. Now a 6 dimensional race with variable non-continuous environmental conditions and a difficult to calculate “finish line” boundary condition might be closer to the practical reality but is far beyond the common conception of “timing”.

  The hard part is knowing what to measure and developing a process model that let’s you confirm the sensitivity, variability, repeatability and ability to extrapolate or predict results in other cases from your measurements.

  Shifting from abstract to practical, I’d suggest taking a look at the netlib.org/benchmark suite. My favorites are the LINPACK variants. Leave the HPL for last. To get a handle on factors affecting results start scalar (1 core) and try different compiler options, machines, size of problem effects and look at run to run variation. As you posted to this multicore group, I’d suggest next converting the codes to openMP.

  For timing consistency, I like the openMP elapsed timer, omp_get_wtime, regardless of how you may effect parallelization. It is portable and generally reliable.

  You are stepping into a huge space. It is a personal favorite. To get a “feel” for the subject area, I think your best first step is to pick a small corner and get deep with it rather than survey the broader domain.

  Good Hunting.

 

  Eugenio Villar进一步补充

 

  In practice, the performance of an algorithm strongly depends on the run-time conditions on a certain processor. The problem becomes harder in multi-processing platforms. And even harder in heterogeneous multi-processing platforms. When your target is the same as the host on which you are developing the code, some of the techniques commented already can be used. If your target is different, then an ISS or a virtualization-based simulator (www.ovpworld.org) can be used. A more abstract technology not requiring the complete compilable code is native simulation (www.teisa.unican.es/scope). We are now working in keeping the accuracy even when the OS calls are removed…

 

  Rick Peralta耐心地逐句解答了Ratika Agarwal的困惑

 

  Performance is relative some metric. There is no absolute performance. You can measure performance relative computations per second. You can measure the relative computations per second based on single or multi-threaded implementations. Of you could measure the ratio of computations to space. Or the quality of the computations (e.g. bits of resolution).

> I’m a beginner.I’m just going to start implementing the comparative analysis of algorithms.

  The first step is to determine what you want to measure. Then look for ways to measure and represent it. Often performance is multi-dimensional and multi-modal, so there is not necessarily a single optimal solution.

  Assuming you are looking for simple computations per second, for various solutions, the wall clock method is a good first step. How long does it take to complete some fixed bit of work. Compile various solutions and see what the run time is. Run the same test for different implementations. On a finer resolution, a profiler can give comparable information for various parts of the code.

> Don’t know from where to start.

  For your research, I’d suggest that you build a benchmark framework that monitors whatever you are looking at (wall clock time, CPU time, memory usage, external I/O, et cetera) and drop in the various solutions. The standard framework will help keep the comparisons comparable.

  Consider the traveling salesman problem. Is it better to do the whole job single threaded, perhaps using the stack to handle branches or to drop each branch to a thread thread or queue to a pool of threads? Code up each and see what the results are. Try to keep the test environment the same, as largely invisible characteristics (such as pipeline behavior or other activity on the system) can significantly impact the results.

  As a general rule, if things are not clear, simplify. Look at a simpler problem, use simpler tools. Too often people get lost in their own sophistication.

  - Rick Keep It Simple Silly ;^)

 

  Eduardo Grosclaude介绍了几本参考书

 

  You may find Grama/Gupta/Karypis/Kumar’s book very enlightening.http://www-users.cs.umn.edu/~karypis/parbook/

 

  Jonathan Beard又给了不少参考资料网址,他同意Rick的说法:“Keep It Simple”

 

  Here’s a few books I’ve found useful that I used for class that are relatively cheap (although I find it best to check out library editions until i really find I can’t live without a book):

  The Art of Multiprocessor Programming – http://bit.ly/pd3fz9

  Computer Architecture: A quantitative approach – http://bit.ly/o7PGoD , although if you don’t have much hardware experience I’d recommend the other book as well Computer Organization and Design.

  You mention creating parallel algorithms from sequential ones, are you planning on an automated process? If so I’d recommend a compiler book, Advanced Compiler Design (http://bit.ly/qK7wm2).

  There are tons of languages to implement parallel algorithms, the choice of which I think should be left to the algorithm and how the author can best express the idea…in other words, don’t get hung up on one language or the next.

  I agree with Rick on the keep it simple. I’d also go one step further and say turn off hyperthreading (not to knock it, but it’s easier to reason about performance without it), speed stepping, synchronize clocks across cores, and when possible tie threads/processes to a specific processor so you know exactly what data you are collecting. Nothing is more frustrating than performing multiple tests only to find out that part of your data must be thrown out because a context swap or frequency increase gave inconsistent results.

  Good luck.

 

  Houmad Tighezza引用了1978年的一本老书里的一段话

 

  The question is simple but the answer good be very complex, the answer depend witch performance you want to know: numerical performance, fault tolerance, CPU, memory, vectorization/ parallelization, speed up, MPIS, MACS, etc.?

  In my opinion, the first book all about the performance evaluation techniques and they application was Wright by DOMENICO FERRARI in 1978 (PRENTICE-HALL).

  Computer Systems Performance Evaluation by Domenico Ferrari 1978 : “The author argues that systems performance evaluation, in the first twenty years of its existence, has developed in substantial isolation with respect to such disciplines as computer architecture, system organization, operating systems, and software engineering. The possible causes for this phenomenon, which seems to be unique in the history of engineering, are explored. Its positive and negative effects on computer science and technology, as well as on performance evaluation itself, are discussed. In the author’s opinion, the drawbacks of isolated development outweigh its advantages. Thus, the author proposes instructional and research initiatives to foster the rapid integration of the performance evaluation viewpoint into the main stream of computer science and engineering.”

 

  Antonio M Flores建议多查查Wiki,上面全是宝

 

  Strictly speaking, algorithms are measured by an analisys of complexity, which is a computer science mathematical concept, so it is irrelevant to the hardware platform. I suggest that you take a deep reading to the article in Wikipedia for a first insight: http://en.wikipedia.org/wiki/Analysis_of_algorithms and http://en.wikipedia.org/wiki/Time_complexity. Of course, for the same complexity algorithms, you can later do profiling, which is a programming concept and for which there is plenty of software tools available: http://en.wikipedia.org/wiki/Profiling_(computer_programming), as well to do a measure of efficiency: http://en.wikipedia.org/wiki/Algorithmic_efficiency

  For parallel versus sequential algorithms, you should be using the concept of speed-up and related: http://en.wikipedia.org/wiki/Speed_up

  Good Luck and Success!!

 

  最后Ratika Agarwal表示感谢。又有一个叫Agnaldo Pereira的新手感谢Ratika Agarwal问了这个好问题,看来对他也帮助很大。

LINPACK和WINE

  最近在忙哪吒提供在线服务的事。全都是配置集群节点SSH、NFS、MPI和编译安装各种库这种杂事。

  今天刚刚在集群上编译通了LINPACKATLAS两个数值计算的库。以前总在超级计算机排行榜的新闻里听到LINPACK的大名,这下有机会见识了。ATLAS编译时间好久,针对硬件做很多优化设置(Automatically Tuned)。本科学线性代数的时候总逃课去上网和编程,没学扎实。工作以后才发现,只要稍微涉及一点科研的事情,数值计算就缺不了。不懂这个的程序员,感觉工资都会少一截。连有关黑客电影都叫The Matrix。关于这个问题,参考孟岩2006年的经典BLOG《理解矩阵》

  移植工作的拦路虎之一是COM控件的问题。某质谱仪器公司的数据格式是私有的,必须用他们提供的控件去读。悲剧的是,他们只提供WIN32下的COM控件。上个月他们公司的软件部门负责人来中国访问,问他,答曰Win64很快会提供支持,但是Linux近期内不打算支持。悲剧呀。

  所以就在考虑如何在云平台里解决这个问题。一种方案是在用户的桌面导成公开的文本格式再上传,但是这样影响用户体验,尤其是对那些数据量很大的用户,数据导出(和校准过滤)本来就是他们使用云计算的可能原因之一;另外一种方案是在服务端放一台Windows,专门转换数据,或者更玄一点,用虚拟机伺候,以便维护和并行;最后一个方案是用WINE,这个能搞通,当然是最好的。

  还有若干软件的BUG等着解决。组里又都在忙着分析数据,四脚朝天,抓不到人手来帮我。哇哇哭。

  像上次说的,一旦干起来,就有大量实际问题要解决。奋力开路中。