先写一句题外话,我是玩社交网络的,例如豆瓣、LinkedIn、42区,俺只是不玩新浪微博而已。
LinkedIn的多核与并行计算讨论组(Multicore & Parallel Computing Group)里,刚刚发生了一次讨论,内容很有价值,尤其是对刚踏入这个领域的新人而言。所以转载到BLOG上。
一开始,是印度小姑娘Ratika Agarwal问了个基本问题:
How to measure the performance of any working algorithm? do any body know? |
Darryl Gove提到了各种经典工具,例如Linux下的Oprofile、Solaris下的DTrace和MDB、AMD的Code Analyst,还有Intel的VTune:
If, by performance, you mean “How long does it take”, then you could use the time command, print out the number of units of work completed and divide one by the other. However, that doesn’t guaranty that your implementation of the algorithm is efficient. So I’d recommend profiling. On Linux look at Oprofile, on Solaris (or Linux) you can use the performance analyzer from Solaris Studio (just announced the beta programme for 12.3), on Windows with AMD chips they have a Code Analyst, Intel you probably want to look at VTune. Most of these tools will give you time based profiles and hardware performance counter data. So you can look into things like the instruction count, cache misses etc. |
John Lilley又提到了Visual Studio Team edition里面的profiler工具。他总结了影响性能的主要因素,优化的时候往往是在各个因素之间寻求平衡。
Also on Windows, Visual Studio Team edition 2008 and 2010 has a built-in code profiler for native code. I think you might get managed code profiling in the Professional versions; however I am unsure about that. However, there are many dimensions to consider when talking about “performance” in addition to CPU. For example, you may want any of the following to optimize: 1) Memory usage 2) Disk usage 3) Wall-clock time (as opposed to CPU loading, which is what a profiler will give you) There are many examples of such tradeoffs. For example consider the problem of sorting a file of records. The fastest sort algorithm will probably load the entire file into memory, perform a parallel sort that utilizes all CPU cores as much as possible, and will probably make a complete copy of the data (or perhaps just a copy of pointers to the data), and then write the output file. However, there are many tradeoffs to consider. What if the file is 100GB and you only have 8GB of memory? Then you must store some intermediate results on disk. You might then consider the CPU/disk tradeoff of compressing the temporary data. These are just some suggestions. |
Jim Dempsey进一步提醒,优化更多是一项工程活儿,程序性能往往受到具体环境影响:如果数据集中在内存里,那么提高缓存的本地性往往比减少需执行的指令数更重要;而如果面临I/O密集型的应用,选择适当的负载均衡技术是最重要的。
The performance of a given algorithm is not necessarily an absolute property. In reading the prior posts you will find that there are various factors that can affect performance and therefore one algorithm is not always superior to another without considering the runtime circumstances. When your working dataset fits within memory then paying attention to improving cache locality may be more important than reducing instruction count. When I/O is involved then choosing a technique that permits computation overlapped with I/O usually is superior. Overlapping I/O or sequencing the data that is fed into an algorithm is generally not considered part of “the algorithm” but may have more of an impact on application performance than the algorithm itself. |
John Lilley顶贴,对此表示同意。
Jim is absolutely right. The “performance” of an algorithm must be judged within the context of the program’s requirements. The kind of performance you want out of a program on a multi-CPU server with virtual memory and a fast SAN might be completely different than what you want from a similar algorithm running on a smart phone. On the server, you might optimize for performance and scalability, On the smartphone, you might optimize for power conservation, memory, and storage use. |
Ratika Agarwal表示压力很大,自己是个新手,不知道从哪里开始。
But I’m a beginner.I’m just going to start implementing the comparative analysis of algorithms.actually I need to do a project for my M.Tech thesis. Thus I am thinking about, to construct some parallel algorithms of the existing sequential ones. Is that the right path what i am following, for my M.Tech project and thesis? Also I am a very new user for parallel programming. Don’t know from where to start. |
于是John Lilley给她推荐了书和其他学习资料。
This book: http://books.google.com/books/about/The_Art_of_Concurrency.html?id=rU68SYVS7S8C is a good primer to parallel programming, if you have access. Also read Dmitry’s blog: http://www.1024cores.net. A goodle search on “parallel programming blog” yields good forums in the top 10. Depending on your programming language of choice, I would use “Thread Building Blocks” from Intel for C++, or Cilk++, or if you are C#/.Net, use PLINQ and TPL. On the Mac, there is Grand Central Dispatch. I’m not sure what Java people use. The maintream programming wisdom today is to think in terms of sub-tasks to execute in parallel, not in terms of threads. The aforementioned tools are all parallel task models. This is a concise illustration of parallelizing quicksort using cilk++: http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/cpp/lin/cref_cls/common/cilk_convertcpp.htm. There are many interesting problems out there that that benefit from efficient parallel implementations, especially in the realm of graph analysis, sorting, searching, indexing, and so on. |
Jeff Kenton又给了些建议。
Your original question was about ” how to measure the performance”. So, what you need to know is how much faster your parallel algorithm is than the sequential one? And how well does it scale with the addition of more cores / threads? The bottlenecks will likely turn out to be communication between the pieces, and sequential parts that you can’t eliminate. You will need to choose a suitable problem to solve. Implement it sequentially first so that you understand the simple version. Then decide what can be done in parallel and how to combine and communicate the results at the end. And also figure out how you will measure your results and how you will instrument the various parts so that you understand what the issues are. Good luck. |
Christopher Aziz提到了LINPACK指标和openMP。
”performance measurement” is an important, frequently misrepresented and often misunderstood topic usually associated with “timing”. Notionally, “timing” a race is simple. You click the stop watch when the starter’s gun fires and then stop it when the first runner or horse crosses the finish line. This is the lay person’s view: ” Where is the hard part?”. Certainly not in this simplistic notion. Now a 6 dimensional race with variable non-continuous environmental conditions and a difficult to calculate “finish line” boundary condition might be closer to the practical reality but is far beyond the common conception of “timing”. The hard part is knowing what to measure and developing a process model that let’s you confirm the sensitivity, variability, repeatability and ability to extrapolate or predict results in other cases from your measurements. Shifting from abstract to practical, I’d suggest taking a look at the netlib.org/benchmark suite. My favorites are the LINPACK variants. Leave the HPL for last. To get a handle on factors affecting results start scalar (1 core) and try different compiler options, machines, size of problem effects and look at run to run variation. As you posted to this multicore group, I’d suggest next converting the codes to openMP. For timing consistency, I like the openMP elapsed timer, omp_get_wtime, regardless of how you may effect parallelization. It is portable and generally reliable. You are stepping into a huge space. It is a personal favorite. To get a “feel” for the subject area, I think your best first step is to pick a small corner and get deep with it rather than survey the broader domain. Good Hunting. |
Eugenio Villar进一步补充
In practice, the performance of an algorithm strongly depends on the run-time conditions on a certain processor. The problem becomes harder in multi-processing platforms. And even harder in heterogeneous multi-processing platforms. When your target is the same as the host on which you are developing the code, some of the techniques commented already can be used. If your target is different, then an ISS or a virtualization-based simulator (www.ovpworld.org) can be used. A more abstract technology not requiring the complete compilable code is native simulation (www.teisa.unican.es/scope). We are now working in keeping the accuracy even when the OS calls are removed… |
Rick Peralta耐心地逐句解答了Ratika Agarwal的困惑
Performance is relative some metric. There is no absolute performance. You can measure performance relative computations per second. You can measure the relative computations per second based on single or multi-threaded implementations. Of you could measure the ratio of computations to space. Or the quality of the computations (e.g. bits of resolution). > I’m a beginner.I’m just going to start implementing the comparative analysis of algorithms. The first step is to determine what you want to measure. Then look for ways to measure and represent it. Often performance is multi-dimensional and multi-modal, so there is not necessarily a single optimal solution. Assuming you are looking for simple computations per second, for various solutions, the wall clock method is a good first step. How long does it take to complete some fixed bit of work. Compile various solutions and see what the run time is. Run the same test for different implementations. On a finer resolution, a profiler can give comparable information for various parts of the code. > Don’t know from where to start. For your research, I’d suggest that you build a benchmark framework that monitors whatever you are looking at (wall clock time, CPU time, memory usage, external I/O, et cetera) and drop in the various solutions. The standard framework will help keep the comparisons comparable. Consider the traveling salesman problem. Is it better to do the whole job single threaded, perhaps using the stack to handle branches or to drop each branch to a thread thread or queue to a pool of threads? Code up each and see what the results are. Try to keep the test environment the same, as largely invisible characteristics (such as pipeline behavior or other activity on the system) can significantly impact the results. As a general rule, if things are not clear, simplify. Look at a simpler problem, use simpler tools. Too often people get lost in their own sophistication. - Rick Keep It Simple Silly ;^) |
Eduardo Grosclaude介绍了几本参考书
You may find Grama/Gupta/Karypis/Kumar’s book very enlightening.http://www-users.cs.umn.edu/~karypis/parbook/ |
Jonathan Beard又给了不少参考资料网址,他同意Rick的说法:“Keep It Simple”
Here’s a few books I’ve found useful that I used for class that are relatively cheap (although I find it best to check out library editions until i really find I can’t live without a book): The Art of Multiprocessor Programming – http://bit.ly/pd3fz9 Computer Architecture: A quantitative approach – http://bit.ly/o7PGoD , although if you don’t have much hardware experience I’d recommend the other book as well Computer Organization and Design. You mention creating parallel algorithms from sequential ones, are you planning on an automated process? If so I’d recommend a compiler book, Advanced Compiler Design (http://bit.ly/qK7wm2). There are tons of languages to implement parallel algorithms, the choice of which I think should be left to the algorithm and how the author can best express the idea…in other words, don’t get hung up on one language or the next. I agree with Rick on the keep it simple. I’d also go one step further and say turn off hyperthreading (not to knock it, but it’s easier to reason about performance without it), speed stepping, synchronize clocks across cores, and when possible tie threads/processes to a specific processor so you know exactly what data you are collecting. Nothing is more frustrating than performing multiple tests only to find out that part of your data must be thrown out because a context swap or frequency increase gave inconsistent results. Good luck. |
Houmad Tighezza引用了1978年的一本老书里的一段话
The question is simple but the answer good be very complex, the answer depend witch performance you want to know: numerical performance, fault tolerance, CPU, memory, vectorization/ parallelization, speed up, MPIS, MACS, etc.? In my opinion, the first book all about the performance evaluation techniques and they application was Wright by DOMENICO FERRARI in 1978 (PRENTICE-HALL). Computer Systems Performance Evaluation by Domenico Ferrari 1978 : “The author argues that systems performance evaluation, in the first twenty years of its existence, has developed in substantial isolation with respect to such disciplines as computer architecture, system organization, operating systems, and software engineering. The possible causes for this phenomenon, which seems to be unique in the history of engineering, are explored. Its positive and negative effects on computer science and technology, as well as on performance evaluation itself, are discussed. In the author’s opinion, the drawbacks of isolated development outweigh its advantages. Thus, the author proposes instructional and research initiatives to foster the rapid integration of the performance evaluation viewpoint into the main stream of computer science and engineering.” |
Antonio M Flores建议多查查Wiki,上面全是宝
Strictly speaking, algorithms are measured by an analisys of complexity, which is a computer science mathematical concept, so it is irrelevant to the hardware platform. I suggest that you take a deep reading to the article in Wikipedia for a first insight: http://en.wikipedia.org/wiki/Analysis_of_algorithms and http://en.wikipedia.org/wiki/Time_complexity. Of course, for the same complexity algorithms, you can later do profiling, which is a programming concept and for which there is plenty of software tools available: http://en.wikipedia.org/wiki/Profiling_(computer_programming), as well to do a measure of efficiency: http://en.wikipedia.org/wiki/Algorithmic_efficiency For parallel versus sequential algorithms, you should be using the concept of speed-up and related: http://en.wikipedia.org/wiki/Speed_up Good Luck and Success!! |
最后Ratika Agarwal表示感谢。又有一个叫Agnaldo Pereira的新手感谢Ratika Agarwal问了这个好问题,看来对他也帮助很大。
看过这么一段话:Computer theorists invent algorithms that solve important problems and analyze their asymptotic behavior (e.g. O(NlogN) or O(N3)). Computer architects set the constant factors for these algorithms. http://csl.stanford.edu/~christos/archdef.html 深表同意:)
呵呵,的确有道理。
Hey I know this is off topic but I was wondering if you knew of any widgets I could add to my
blog that automatically tweet my newest twitter updates.
I’ve been looking for a plug-in like this for quite some time and was hoping maybe you would have some experience with something like this. Please let me know if you run into anything. I truly enjoy reading your blog and I look forward to your new updates.