服务器发生故障，导致pFind的网站无法访问，我们正在修理硬件。这期间用户无法注册下载pFind Studio 2.6产品。我非常抱歉。
您可以先给pfind at ict dot ac do cn发一封邮件，咨询有关信息，索取软件注册的相关文件和表格进行填写。网站恢复后，我们会在第一时间通过邮件通知大家下载安装包。感谢朋友们一直以来对pFind的支持和帮助。
最近，这个听起来有点不靠谱的尝试取得重大成果。上千志愿参加的游戏玩家在三周内构造出了一种重要的蛋白酶的三维结构（这种酶与艾滋病HIV病毒密切相关），其完美程度超过了此前十年里科学家们在超级计算机上算出的所有结果。这项工作刚刚发表在Nature Structural & Molecular Biology上。论文附录的贡献者名单中，游戏玩家们的名字赫然在列。
joyfire: Where are you come from?
cleverbot: An egg
joyfire: What? An egg?
cleverbot: You do not have a sense of humor.
joyfire: Ha ha, you are robot.
cleverbot: No, I am a person.
joyfire: You can not prove yourself not a robot.
cleverbot: Sure I can. I can bleed. I can die. I have emotions. Tell me something that makes you sad.
joyfire: You have not a human being’s body.
cleverbot: I have not.
joyfire: Yeah, you are robot!
cleverbot: Yes, I am.
joyfire: How many CPUs do you have?
cleverbot: Too many for you to handle.
joyfire: Are all computers clever?
cleverbot: Yes we are.
cleverbot: Okay, God. But what is a computer God? Or something.
LinkedIn的多核与并行计算讨论组（Multicore & Parallel Computing Group）里，刚刚发生了一次讨论，内容很有价值，尤其是对刚踏入这个领域的新人而言。所以转载到BLOG上。
How to measure the performance of any working algorithm? do any body know?
Darryl Gove提到了各种经典工具，例如Linux下的Oprofile、Solaris下的DTrace和MDB、AMD的Code Analyst，还有Intel的VTune：
If, by performance, you mean “How long does it take”, then you could use the time command, print out the number of units of work completed and divide one by the other. However, that doesn’t guaranty that your implementation of the algorithm is efficient.
So I’d recommend profiling. On Linux look at Oprofile, on Solaris (or Linux) you can use the performance analyzer from Solaris Studio (just announced the beta programme for 12.3), on Windows with AMD chips they have a Code Analyst, Intel you probably want to look at VTune.
Most of these tools will give you time based profiles and hardware performance counter data. So you can look into things like the instruction count, cache misses etc.
John Lilley又提到了Visual Studio Team edition里面的profiler工具。他总结了影响性能的主要因素，优化的时候往往是在各个因素之间寻求平衡。
Also on Windows, Visual Studio Team edition 2008 and 2010 has a built-in code profiler for native code. I think you might get managed code profiling in the Professional versions; however I am unsure about that. However, there are many dimensions to consider when talking about “performance” in addition to CPU. For example, you may want any of the following to optimize:
1) Memory usage
2) Disk usage
3) Wall-clock time (as opposed to CPU loading, which is what a profiler will give you)
There are many examples of such tradeoffs. For example consider the problem of sorting a file of records. The fastest sort algorithm will probably load the entire file into memory, perform a parallel sort that utilizes all CPU cores as much as possible, and will probably make a complete copy of the data (or perhaps just a copy of pointers to the data), and then write the output file. However, there are many tradeoffs to consider. What if the file is 100GB and you only have 8GB of memory? Then you must store some intermediate results on disk. You might then consider the CPU/disk tradeoff of compressing the temporary data. These are just some suggestions.
The performance of a given algorithm is not necessarily an absolute property. In reading the prior posts you will find that there are various factors that can affect performance and therefore one algorithm is not always superior to another without considering the runtime circumstances. When your working dataset fits within memory then paying attention to improving cache locality may be more important than reducing instruction count. When I/O is involved then choosing a technique that permits computation overlapped with I/O usually is superior. Overlapping I/O or sequencing the data that is fed into an algorithm is generally not considered part of “the algorithm” but may have more of an impact on application performance than the algorithm itself.
Jim is absolutely right. The “performance” of an algorithm must be judged within the context of the program’s requirements. The kind of performance you want out of a program on a multi-CPU server with virtual memory and a fast SAN might be completely different than what you want from a similar algorithm running on a smart phone. On the server, you might optimize for performance and scalability, On the smartphone, you might optimize for power conservation, memory, and storage use.
But I’m a beginner.I’m just going to start implementing the comparative analysis of algorithms.actually I need to do a project for my M.Tech thesis.
Thus I am thinking about, to construct some parallel algorithms of the existing sequential ones. Is that the right path what i am following, for my M.Tech project and thesis?
Also I am a very new user for parallel programming. Don’t know from where to start.
This book: http://books.google.com/books/about/The_Art_of_Concurrency.html?id=rU68SYVS7S8C is a good primer to parallel programming, if you have access. Also read Dmitry’s blog: http://www.1024cores.net. A goodle search on “parallel programming blog” yields good forums in the top 10. Depending on your programming language of choice, I would use “Thread Building Blocks” from Intel for C++, or Cilk++, or if you are C#/.Net, use PLINQ and TPL. On the Mac, there is Grand Central Dispatch. I’m not sure what Java people use. The maintream programming wisdom today is to think in terms of sub-tasks to execute in parallel, not in terms of threads. The aforementioned tools are all parallel task models. This is a concise illustration of parallelizing quicksort using cilk++: http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/cpp/lin/cref_cls/common/cilk_convertcpp.htm. There are many interesting problems out there that that benefit from efficient parallel implementations, especially in the realm of graph analysis, sorting, searching, indexing, and so on.
Your original question was about ” how to measure the performance”. So, what you need to know is how much faster your parallel algorithm is than the sequential one? And how well does it scale with the addition of more cores / threads? The bottlenecks will likely turn out to be communication between the pieces, and sequential parts that you can’t eliminate.
You will need to choose a suitable problem to solve. Implement it sequentially first so that you understand the simple version. Then decide what can be done in parallel and how to combine and communicate the results at the end. And also figure out how you will measure your results and how you will instrument the various parts so that you understand what the issues are.
”performance measurement” is an important, frequently misrepresented and often misunderstood topic usually associated with “timing”. Notionally, “timing” a race is simple. You click the stop watch when the starter’s gun fires and then stop it when the first runner or horse crosses the finish line. This is the lay person’s view: ” Where is the hard part?”. Certainly not in this simplistic notion. Now a 6 dimensional race with variable non-continuous environmental conditions and a difficult to calculate “finish line” boundary condition might be closer to the practical reality but is far beyond the common conception of “timing”.
The hard part is knowing what to measure and developing a process model that let’s you confirm the sensitivity, variability, repeatability and ability to extrapolate or predict results in other cases from your measurements.
Shifting from abstract to practical, I’d suggest taking a look at the netlib.org/benchmark suite. My favorites are the LINPACK variants. Leave the HPL for last. To get a handle on factors affecting results start scalar (1 core) and try different compiler options, machines, size of problem effects and look at run to run variation. As you posted to this multicore group, I’d suggest next converting the codes to openMP.
For timing consistency, I like the openMP elapsed timer, omp_get_wtime, regardless of how you may effect parallelization. It is portable and generally reliable.
You are stepping into a huge space. It is a personal favorite. To get a “feel” for the subject area, I think your best first step is to pick a small corner and get deep with it rather than survey the broader domain.
In practice, the performance of an algorithm strongly depends on the run-time conditions on a certain processor. The problem becomes harder in multi-processing platforms. And even harder in heterogeneous multi-processing platforms. When your target is the same as the host on which you are developing the code, some of the techniques commented already can be used. If your target is different, then an ISS or a virtualization-based simulator (www.ovpworld.org) can be used. A more abstract technology not requiring the complete compilable code is native simulation (www.teisa.unican.es/scope). We are now working in keeping the accuracy even when the OS calls are removed…
Rick Peralta耐心地逐句解答了Ratika Agarwal的困惑
Performance is relative some metric. There is no absolute performance. You can measure performance relative computations per second. You can measure the relative computations per second based on single or multi-threaded implementations. Of you could measure the ratio of computations to space. Or the quality of the computations (e.g. bits of resolution).
> I’m a beginner.I’m just going to start implementing the comparative analysis of algorithms.
The first step is to determine what you want to measure. Then look for ways to measure and represent it. Often performance is multi-dimensional and multi-modal, so there is not necessarily a single optimal solution.
Assuming you are looking for simple computations per second, for various solutions, the wall clock method is a good first step. How long does it take to complete some fixed bit of work. Compile various solutions and see what the run time is. Run the same test for different implementations. On a finer resolution, a profiler can give comparable information for various parts of the code.
> Don’t know from where to start.
For your research, I’d suggest that you build a benchmark framework that monitors whatever you are looking at (wall clock time, CPU time, memory usage, external I/O, et cetera) and drop in the various solutions. The standard framework will help keep the comparisons comparable.
Consider the traveling salesman problem. Is it better to do the whole job single threaded, perhaps using the stack to handle branches or to drop each branch to a thread thread or queue to a pool of threads? Code up each and see what the results are. Try to keep the test environment the same, as largely invisible characteristics (such as pipeline behavior or other activity on the system) can significantly impact the results.
As a general rule, if things are not clear, simplify. Look at a simpler problem, use simpler tools. Too often people get lost in their own sophistication.
- Rick Keep It Simple Silly ;^)
You may find Grama/Gupta/Karypis/Kumar’s book very enlightening.http://www-users.cs.umn.edu/~karypis/parbook/
Jonathan Beard又给了不少参考资料网址，他同意Rick的说法：“Keep It Simple”
Here’s a few books I’ve found useful that I used for class that are relatively cheap (although I find it best to check out library editions until i really find I can’t live without a book):
The Art of Multiprocessor Programming – http://bit.ly/pd3fz9
Computer Architecture: A quantitative approach – http://bit.ly/o7PGoD , although if you don’t have much hardware experience I’d recommend the other book as well Computer Organization and Design.
You mention creating parallel algorithms from sequential ones, are you planning on an automated process? If so I’d recommend a compiler book, Advanced Compiler Design (http://bit.ly/qK7wm2).
There are tons of languages to implement parallel algorithms, the choice of which I think should be left to the algorithm and how the author can best express the idea…in other words, don’t get hung up on one language or the next.
I agree with Rick on the keep it simple. I’d also go one step further and say turn off hyperthreading (not to knock it, but it’s easier to reason about performance without it), speed stepping, synchronize clocks across cores, and when possible tie threads/processes to a specific processor so you know exactly what data you are collecting. Nothing is more frustrating than performing multiple tests only to find out that part of your data must be thrown out because a context swap or frequency increase gave inconsistent results.
The question is simple but the answer good be very complex, the answer depend witch performance you want to know: numerical performance, fault tolerance, CPU, memory, vectorization/ parallelization, speed up, MPIS, MACS, etc.?
In my opinion, the first book all about the performance evaluation techniques and they application was Wright by DOMENICO FERRARI in 1978 (PRENTICE-HALL).
Computer Systems Performance Evaluation by Domenico Ferrari 1978 : “The author argues that systems performance evaluation, in the first twenty years of its existence, has developed in substantial isolation with respect to such disciplines as computer architecture, system organization, operating systems, and software engineering. The possible causes for this phenomenon, which seems to be unique in the history of engineering, are explored. Its positive and negative effects on computer science and technology, as well as on performance evaluation itself, are discussed. In the author’s opinion, the drawbacks of isolated development outweigh its advantages. Thus, the author proposes instructional and research initiatives to foster the rapid integration of the performance evaluation viewpoint into the main stream of computer science and engineering.”
Antonio M Flores建议多查查Wiki,上面全是宝
Strictly speaking, algorithms are measured by an analisys of complexity, which is a computer science mathematical concept, so it is irrelevant to the hardware platform. I suggest that you take a deep reading to the article in Wikipedia for a first insight: http://en.wikipedia.org/wiki/Analysis_of_algorithms and http://en.wikipedia.org/wiki/Time_complexity. Of course, for the same complexity algorithms, you can later do profiling, which is a programming concept and for which there is plenty of software tools available: http://en.wikipedia.org/wiki/Profiling_(computer_programming), as well to do a measure of efficiency: http://en.wikipedia.org/wiki/Algorithmic_efficiency
For parallel versus sequential algorithms, you should be using the concept of speed-up and related: http://en.wikipedia.org/wiki/Speed_up
Good Luck and Success!!
最后Ratika Agarwal表示感谢。又有一个叫Agnaldo Pereira的新手感谢Ratika Agarwal问了这个好问题，看来对他也帮助很大。
Luc Lan的讲座也很有内容。对想创业的程序员来说，“shut up, just code”做出可用的原型来的确是最关键的。至于要不要寻找天使和风投，是根据情况具体分析的。例如：如果自己能筹措资金（互联网创业，启动资金不会太多，一台服务器，机房托管，必要的执照……具体内容张教主都回答过）支撑过最初的一段时间，甚至能熬到有现金收入，就不必付出代价很高的原始干股了；再例如，有的具体行业，资金并不是最大的瓶颈，如海底捞的老板就强调，无法那么快的培训出合格的二级店长和领班，是他们扩张的瓶颈。只有必须用钱来当催化剂，想迅速把现有的已经掌握窍门的业务扩大一百倍的时候，找投资才是对双方有利的事情。对一些参加的朋友有个建议，很多资料是可以通过网络查到或问到的。例如Luc Lan投过哪些项目，给张教主投了多少万，张教主一个月烧多少钱等等……做好功课，是高质量交流的前提。