Category Archives: 0和1

怎样测量算法的性能?

  先写一句题外话,我是玩社交网络的,例如豆瓣LinkedIn42区,俺只是不玩新浪微博而已。

  LinkedIn的多核与并行计算讨论组(Multicore & Parallel Computing Group)里,刚刚发生了一次讨论,内容很有价值,尤其是对刚踏入这个领域的新人而言。所以转载到BLOG上。

  一开始,是印度小姑娘Ratika Agarwal问了个基本问题:

 

  How to measure the performance of any working algorithm? do any body know?

 

  Darryl Gove提到了各种经典工具,例如Linux下的Oprofile、Solaris下的DTrace和MDB、AMD的Code Analyst,还有Intel的VTune:

 

  If, by performance, you mean “How long does it take”, then you could use the time command, print out the number of units of work completed and divide one by the other. However, that doesn’t guaranty that your implementation of the algorithm is efficient.

  So I’d recommend profiling. On Linux look at Oprofile, on Solaris (or Linux) you can use the performance analyzer from Solaris Studio (just announced the beta programme for 12.3), on Windows with AMD chips they have a Code Analyst, Intel you probably want to look at VTune.

  Most of these tools will give you time based profiles and hardware performance counter data. So you can look into things like the instruction count, cache misses etc.

 

  John Lilley又提到了Visual Studio Team edition里面的profiler工具。他总结了影响性能的主要因素,优化的时候往往是在各个因素之间寻求平衡。

 

  Also on Windows, Visual Studio Team edition 2008 and 2010 has a built-in code profiler for native code. I think you might get managed code profiling in the Professional versions; however I am unsure about that. However, there are many dimensions to consider when talking about “performance” in addition to CPU. For example, you may want any of the following to optimize:

  1) Memory usage

  2) Disk usage

  3) Wall-clock time (as opposed to CPU loading, which is what a profiler will give you)

  There are many examples of such tradeoffs. For example consider the problem of sorting a file of records. The fastest sort algorithm will probably load the entire file into memory, perform a parallel sort that utilizes all CPU cores as much as possible, and will probably make a complete copy of the data (or perhaps just a copy of pointers to the data), and then write the output file. However, there are many tradeoffs to consider. What if the file is 100GB and you only have 8GB of memory? Then you must store some intermediate results on disk. You might then consider the CPU/disk tradeoff of compressing the temporary data. These are just some suggestions.

 

  Jim Dempsey进一步提醒,优化更多是一项工程活儿,程序性能往往受到具体环境影响:如果数据集中在内存里,那么提高缓存的本地性往往比减少需执行的指令数更重要;而如果面临I/O密集型的应用,选择适当的负载均衡技术是最重要的。

 

  The performance of a given algorithm is not necessarily an absolute property. In reading the prior posts you will find that there are various factors that can affect performance and therefore one algorithm is not always superior to another without considering the runtime circumstances. When your working dataset fits within memory then paying attention to improving cache locality may be more important than reducing instruction count. When I/O is involved then choosing a technique that permits computation overlapped with I/O usually is superior. Overlapping I/O or sequencing the data that is fed into an algorithm is generally not considered part of “the algorithm” but may have more of an impact on application performance than the algorithm itself.

 

  John Lilley顶贴,对此表示同意。

 

  Jim is absolutely right. The “performance” of an algorithm must be judged within the context of the program’s requirements. The kind of performance you want out of a program on a multi-CPU server with virtual memory and a fast SAN might be completely different than what you want from a similar algorithm running on a smart phone. On the server, you might optimize for performance and scalability, On the smartphone, you might optimize for power conservation, memory, and storage use.

 

  Ratika Agarwal表示压力很大,自己是个新手,不知道从哪里开始。

 

  But I’m a beginner.I’m just going to start implementing the comparative analysis of algorithms.actually I need to do a project for my M.Tech thesis.

  Thus I am thinking about, to construct some parallel algorithms of the existing sequential ones. Is that the right path what i am following, for my M.Tech project and thesis?

  Also I am a very new user for parallel programming. Don’t know from where to start.

 

  于是John Lilley给她推荐了书和其他学习资料。

 

  This book: http://books.google.com/books/about/The_Art_of_Concurrency.html?id=rU68SYVS7S8C is a good primer to parallel programming, if you have access. Also read Dmitry’s blog: http://www.1024cores.net. A goodle search on “parallel programming blog” yields good forums in the top 10. Depending on your programming language of choice, I would use “Thread Building Blocks” from Intel for C++, or Cilk++, or if you are C#/.Net, use PLINQ and TPL. On the Mac, there is Grand Central Dispatch. I’m not sure what Java people use. The maintream programming wisdom today is to think in terms of sub-tasks to execute in parallel, not in terms of threads. The aforementioned tools are all parallel task models. This is a concise illustration of parallelizing quicksort using cilk++: http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/cpp/lin/cref_cls/common/cilk_convertcpp.htm. There are many interesting problems out there that that benefit from efficient parallel implementations, especially in the realm of graph analysis, sorting, searching, indexing, and so on.

 

  Jeff Kenton又给了些建议。

 

  Your original question was about ” how to measure the performance”. So, what you need to know is how much faster your parallel algorithm is than the sequential one? And how well does it scale with the addition of more cores / threads? The bottlenecks will likely turn out to be communication between the pieces, and sequential parts that you can’t eliminate.

  You will need to choose a suitable problem to solve. Implement it sequentially first so that you understand the simple version. Then decide what can be done in parallel and how to combine and communicate the results at the end. And also figure out how you will measure your results and how you will instrument the various parts so that you understand what the issues are.

  Good luck.

 

  Christopher Aziz提到了LINPACK指标和openMP。

 

  ”performance measurement” is an important, frequently misrepresented and often misunderstood topic usually associated with “timing”. Notionally, “timing” a race is simple. You click the stop watch when the starter’s gun fires and then stop it when the first runner or horse crosses the finish line. This is the lay person’s view: ” Where is the hard part?”. Certainly not in this simplistic notion. Now a 6 dimensional race with variable non-continuous environmental conditions and a difficult to calculate “finish line” boundary condition might be closer to the practical reality but is far beyond the common conception of “timing”.

  The hard part is knowing what to measure and developing a process model that let’s you confirm the sensitivity, variability, repeatability and ability to extrapolate or predict results in other cases from your measurements.

  Shifting from abstract to practical, I’d suggest taking a look at the netlib.org/benchmark suite. My favorites are the LINPACK variants. Leave the HPL for last. To get a handle on factors affecting results start scalar (1 core) and try different compiler options, machines, size of problem effects and look at run to run variation. As you posted to this multicore group, I’d suggest next converting the codes to openMP.

  For timing consistency, I like the openMP elapsed timer, omp_get_wtime, regardless of how you may effect parallelization. It is portable and generally reliable.

  You are stepping into a huge space. It is a personal favorite. To get a “feel” for the subject area, I think your best first step is to pick a small corner and get deep with it rather than survey the broader domain.

  Good Hunting.

 

  Eugenio Villar进一步补充

 

  In practice, the performance of an algorithm strongly depends on the run-time conditions on a certain processor. The problem becomes harder in multi-processing platforms. And even harder in heterogeneous multi-processing platforms. When your target is the same as the host on which you are developing the code, some of the techniques commented already can be used. If your target is different, then an ISS or a virtualization-based simulator (www.ovpworld.org) can be used. A more abstract technology not requiring the complete compilable code is native simulation (www.teisa.unican.es/scope). We are now working in keeping the accuracy even when the OS calls are removed…

 

  Rick Peralta耐心地逐句解答了Ratika Agarwal的困惑

 

  Performance is relative some metric. There is no absolute performance. You can measure performance relative computations per second. You can measure the relative computations per second based on single or multi-threaded implementations. Of you could measure the ratio of computations to space. Or the quality of the computations (e.g. bits of resolution).

> I’m a beginner.I’m just going to start implementing the comparative analysis of algorithms.

  The first step is to determine what you want to measure. Then look for ways to measure and represent it. Often performance is multi-dimensional and multi-modal, so there is not necessarily a single optimal solution.

  Assuming you are looking for simple computations per second, for various solutions, the wall clock method is a good first step. How long does it take to complete some fixed bit of work. Compile various solutions and see what the run time is. Run the same test for different implementations. On a finer resolution, a profiler can give comparable information for various parts of the code.

> Don’t know from where to start.

  For your research, I’d suggest that you build a benchmark framework that monitors whatever you are looking at (wall clock time, CPU time, memory usage, external I/O, et cetera) and drop in the various solutions. The standard framework will help keep the comparisons comparable.

  Consider the traveling salesman problem. Is it better to do the whole job single threaded, perhaps using the stack to handle branches or to drop each branch to a thread thread or queue to a pool of threads? Code up each and see what the results are. Try to keep the test environment the same, as largely invisible characteristics (such as pipeline behavior or other activity on the system) can significantly impact the results.

  As a general rule, if things are not clear, simplify. Look at a simpler problem, use simpler tools. Too often people get lost in their own sophistication.

  - Rick Keep It Simple Silly ;^)

 

  Eduardo Grosclaude介绍了几本参考书

 

  You may find Grama/Gupta/Karypis/Kumar’s book very enlightening.http://www-users.cs.umn.edu/~karypis/parbook/

 

  Jonathan Beard又给了不少参考资料网址,他同意Rick的说法:“Keep It Simple”

 

  Here’s a few books I’ve found useful that I used for class that are relatively cheap (although I find it best to check out library editions until i really find I can’t live without a book):

  The Art of Multiprocessor Programming – http://bit.ly/pd3fz9

  Computer Architecture: A quantitative approach – http://bit.ly/o7PGoD , although if you don’t have much hardware experience I’d recommend the other book as well Computer Organization and Design.

  You mention creating parallel algorithms from sequential ones, are you planning on an automated process? If so I’d recommend a compiler book, Advanced Compiler Design (http://bit.ly/qK7wm2).

  There are tons of languages to implement parallel algorithms, the choice of which I think should be left to the algorithm and how the author can best express the idea…in other words, don’t get hung up on one language or the next.

  I agree with Rick on the keep it simple. I’d also go one step further and say turn off hyperthreading (not to knock it, but it’s easier to reason about performance without it), speed stepping, synchronize clocks across cores, and when possible tie threads/processes to a specific processor so you know exactly what data you are collecting. Nothing is more frustrating than performing multiple tests only to find out that part of your data must be thrown out because a context swap or frequency increase gave inconsistent results.

  Good luck.

 

  Houmad Tighezza引用了1978年的一本老书里的一段话

 

  The question is simple but the answer good be very complex, the answer depend witch performance you want to know: numerical performance, fault tolerance, CPU, memory, vectorization/ parallelization, speed up, MPIS, MACS, etc.?

  In my opinion, the first book all about the performance evaluation techniques and they application was Wright by DOMENICO FERRARI in 1978 (PRENTICE-HALL).

  Computer Systems Performance Evaluation by Domenico Ferrari 1978 : “The author argues that systems performance evaluation, in the first twenty years of its existence, has developed in substantial isolation with respect to such disciplines as computer architecture, system organization, operating systems, and software engineering. The possible causes for this phenomenon, which seems to be unique in the history of engineering, are explored. Its positive and negative effects on computer science and technology, as well as on performance evaluation itself, are discussed. In the author’s opinion, the drawbacks of isolated development outweigh its advantages. Thus, the author proposes instructional and research initiatives to foster the rapid integration of the performance evaluation viewpoint into the main stream of computer science and engineering.”

 

  Antonio M Flores建议多查查Wiki,上面全是宝

 

  Strictly speaking, algorithms are measured by an analisys of complexity, which is a computer science mathematical concept, so it is irrelevant to the hardware platform. I suggest that you take a deep reading to the article in Wikipedia for a first insight: http://en.wikipedia.org/wiki/Analysis_of_algorithms and http://en.wikipedia.org/wiki/Time_complexity. Of course, for the same complexity algorithms, you can later do profiling, which is a programming concept and for which there is plenty of software tools available: http://en.wikipedia.org/wiki/Profiling_(computer_programming), as well to do a measure of efficiency: http://en.wikipedia.org/wiki/Algorithmic_efficiency

  For parallel versus sequential algorithms, you should be using the concept of speed-up and related: http://en.wikipedia.org/wiki/Speed_up

  Good Luck and Success!!

 

  最后Ratika Agarwal表示感谢。又有一个叫Agnaldo Pereira的新手感谢Ratika Agarwal问了这个好问题,看来对他也帮助很大。

不完备性定理和创业

  忙碌的假期,回来了。这些天买了刘未鹏的《暗时间》村上春树的《地下》黄铁鹰的《海底捞你学不会》。前两本强烈推荐。

    

  最近创业者言必称保罗·格雷厄姆和Y Combinator基金。而Y Combinator这个词源于lambda函数不动点。看看刘未鹏这篇《康托尔、哥德尔、图灵——永恒的金色对角线》能了解更多。简单说,哥德尔对不完备性定理的证明,意味着不可能单纯用一套逻辑系统解释整个宇宙。这不仅仅对数学,甚至对哲学和艺术也产生了重大影响。

  假期结束前一天,参加42qu.com的“42区 . 技术 . 创业”活动。有两个报告人,一位是Resys的创始人xlvector,另一位是PE.VC的投资人Luc Lan。

  两年前听过xlvector的讲座,那时候他还在搞学术,重点是Netflix百万大奖的推荐算法比赛。这次再听,他已经加入工业界,所以就有更多实用化、架构,以及产品的商业效果的考虑。非常期待他的新书。

  Luc Lan的讲座也很有内容。对想创业的程序员来说,“shut up, just code”做出可用的原型来的确是最关键的。至于要不要寻找天使和风投,是根据情况具体分析的。例如:如果自己能筹措资金(互联网创业,启动资金不会太多,一台服务器,机房托管,必要的执照……具体内容张教主都回答过)支撑过最初的一段时间,甚至能熬到有现金收入,就不必付出代价很高的原始干股了;再例如,有的具体行业,资金并不是最大的瓶颈,如海底捞的老板就强调,无法那么快的培训出合格的二级店长和领班,是他们扩张的瓶颈。只有必须用钱来当催化剂,想迅速把现有的已经掌握窍门的业务扩大一百倍的时候,找投资才是对双方有利的事情。对一些参加的朋友有个建议,很多资料是可以通过网络查到或问到的。例如Luc Lan投过哪些项目,给张教主投了多少万,张教主一个月烧多少钱等等……做好功课,是高质量交流的前提。

招聘程序员、《北京沉没》和Go语言性能

  先转发一篇招聘,我的朋友在做云计算方面的创业,招Flex/Erlang/Python程序员。工作地点在北京,有兴趣的人可以点进去看一看

  5、6月BLOG更新频率明显变慢了,有人问我是不是开始玩微博了,还有人问是不是在准备跳槽。都不是,最近公私事都有些忙,懈怠了,不好意思。另外随着年龄变老,似乎越来越倾向更长篇幅的写作,零碎的想法大多懒得动手敲(记得科学松鼠会上翻译过一篇报道,研究人员发现年龄和写作博客的长度成正比,没搜索到链接)。还是该放松点儿,有长则长,可短就短。只要写出来,总会有价值。

  前两天北京下大雨内涝。豆瓣上马上出现了一篇与此有关的小说《北京沉没》,得到众多推荐,大家都在殷切等待小说的后续情节。

  关于编程语言的性能向来是个大水坑,有诸多争论。前不久,Google公司的Robert Hundt刚刚发表了一篇论文 Loop Recognition in C++/Java/Go/Scala.,,发布了一个简明而紧凑的Performance Benchmarks,分别对C++, Go, Java, Scala四种语言进行性能比较。测试包使用的loop finding算法实现中不涉及任何高级语言特性,只使用了各语言的基础设施:基本容器类、循环和选择、内存分配机制等。文中的主要结论是在32位系统下C++程序最快,常规条件下Java语言运行时间是C++程序的12.6倍,Go语言为7倍;相同的语言条件,不同实现效果差距很大,例如:C++语言如果采用一般教科书知识的“学生式”实现,会比有经验工程师精心优化后的结果慢8倍以上。而Java语言的主要瓶颈在于其内存回收机制,如果采用SPEC优化,则速度起码提高3倍,如果在64位系统下,性能比32位系统提高超过1倍。更多细节,大家可以直接看论文。

  论文对Google自己人发明的Go语言并没有太推崇,认为其性能还比不上C++,开发难度又高于Java。于是Go语言的团队来劲了,刚在官方BLOG上指出,Go语言版本的算法还有很大的优化余地。他们使用了Go开发包里提供的性能测试工具对代码进行了改善,运行速度提高了10倍以上,而申请的内存降低了6倍。

  这篇论文在其他语言社群内也掀起不少讨论。编程语言的性能比较往往涉及到很多因素(编程者的经验和风格,比较题目与不同语言的特性之间的契合等)。对于实际工程而言,选择编程语言的着眼点应该是现有资源和经验,并考虑混合编程。我个人在pFind项目的经验是,保守些,尽量利用团队内多数人熟悉的技术。如果项目负责人只顾着添经验追时髦,把手头的正事当作尝试新技术的小白鼠,十有八九要坏事。

Linux内核即将进入3.0时代

  昨天Lyz美女编译Linux内核panic了,今天早上和她一起冲进机房去重启。每次进机房都觉得轰鸣声好像飞机的发动机。空调太冷,打喷嚏。

  5月30日,Linus在动身前往日本参加LinuxCon 会议前,将他的Kernel git tree 打上了3.0-rc1标签。漫长的2.6时代终于要结束了。

  2.4内核历时4年24个版本。2.6内核历时6年40个版本。这是关键性的10年,Linux内核已经成为互联网服务的基石,虽然PC桌面进展不明显,但在智能手机领域Android系统大红大紫。回想起99年在rainbow的机器上安装红帽子,那时候Linux刚刚摆脱黑客面纱,被IBM和Oracle等商业公司支持,很多大学老师也没听过。大学生只有在蓝宝石和AKA这类爱折腾的草根技术组织里才有机会接触开源的世界。如今计算机专业科班出身的孩子们,若没用过Linux都不好意思和别人打招呼。

LINPACK和WINE

  最近在忙哪吒提供在线服务的事。全都是配置集群节点SSH、NFS、MPI和编译安装各种库这种杂事。

  今天刚刚在集群上编译通了LINPACKATLAS两个数值计算的库。以前总在超级计算机排行榜的新闻里听到LINPACK的大名,这下有机会见识了。ATLAS编译时间好久,针对硬件做很多优化设置(Automatically Tuned)。本科学线性代数的时候总逃课去上网和编程,没学扎实。工作以后才发现,只要稍微涉及一点科研的事情,数值计算就缺不了。不懂这个的程序员,感觉工资都会少一截。连有关黑客电影都叫The Matrix。关于这个问题,参考孟岩2006年的经典BLOG《理解矩阵》

  移植工作的拦路虎之一是COM控件的问题。某质谱仪器公司的数据格式是私有的,必须用他们提供的控件去读。悲剧的是,他们只提供WIN32下的COM控件。上个月他们公司的软件部门负责人来中国访问,问他,答曰Win64很快会提供支持,但是Linux近期内不打算支持。悲剧呀。

  所以就在考虑如何在云平台里解决这个问题。一种方案是在用户的桌面导成公开的文本格式再上传,但是这样影响用户体验,尤其是对那些数据量很大的用户,数据导出(和校准过滤)本来就是他们使用云计算的可能原因之一;另外一种方案是在服务端放一台Windows,专门转换数据,或者更玄一点,用虚拟机伺候,以便维护和并行;最后一个方案是用WINE,这个能搞通,当然是最好的。

  还有若干软件的BUG等着解决。组里又都在忙着分析数据,四脚朝天,抓不到人手来帮我。哇哇哭。

  像上次说的,一旦干起来,就有大量实际问题要解决。奋力开路中。

生物数据处理和分布式并行计算

  写一点我对生物信息云计算的粗浅认识。首先,所谓云计算是一个商业模式的概念,其内涵里Saas占很大比重,网络服务代替软件产品。另一方面,技术角度的个人观点,一个Web服务的后台涉及到了大规模分布式并行的基础设施,才有资格被称为“云计算”(当然,这一定义有争议)。这篇Blog先写技术观点,后面再加关于用户和服务的讨论。

  技术上,大规模分布式并行计算被分为计算密集型和数据密集型两类。

  很多物理、地质和气象等领域的科学计算都是典型的计算密集型问题,CPU是瓶颈,涉及到外存I/O量相对不高。对这类问题的解决思路就是传统的“数据找CPU”。具体一点说,目前编程实现的工业标准是MPI,利用MPI提供的Master/Slave模式启动和调度集群上各个节点上的进程,数据传输共享常利用NFS,底层可以用硬件手段提高数据I/O性能。像Fold@home这一类志愿计算项目,往往本质上也是计算密集型的,当然软件架构有所不同。

  而Web领域的计算问题,例如搜索引擎的信息检索,主要是数据密集型问题(或者也称为IO密集型问题),这种情况下,CPU不再稀缺,海量数据的内外存交换的性能成了的焦点。对此MapReduce模式给出了很有效的解决思路,也就是所谓“CPU找数据”。具体来说,Hadoop是目前最热门的主流框架,先对查询对象进行索引,通过分布式文件系统备份到集群各处,再调度相对轻量的进程,分阶段执行查找和归并操作。

  那么有没有这样的需求:输入输出的数据很海量,其处理过程的算法复杂性又很高,很占CPU。有的,例如现代天文学需要对成T甚至成P的哈勃望远镜图片进行搜索过滤……面对这一类的问题,MapReduce模型就不一定适用,其瓶颈出现在海量数据的传输性能上(Web搜索模型里,尽管是数据密集型应用,但被检索的关键词和输出结果是相对很小的,而在很多科学应用里,往集群提交的待查询本身就是海量数据)。Secter/Sphere在其论文和技术报告里总结,它对Hadoop有明显性能优势的原因之一,就是专门开发的底层传输协议UDT比后者使用的通用TCP/IP要高效。

  对Hadoop这个特定产品来说,由于它是Java实现的,相对于C/C++有性能劣势,这个短板对CPU耗费型的应用来说很要命(这也是Secter/Sphere快的另一个原因)。有很多解决方案,例如Yahoo!开发了MPI on Hadoop,用MPI具体负责干活,Hadoop包裹在上层负责宏观调度和容错;而百度实现了Hadoop C++ Extension,把Hadoop里面的关键瓶颈部件都换成了C++的;前不久刚刚发布的Hadoop下一个版本的Roadmap里,也宣布要进行彻底重构,重点解决性能问题。

  生物信息领域,无论是基因测序还是质谱鉴定蛋白,产生出来的数据量巨大,其处理运算又吃CPU。对现有的分布式计算模型是个挑战。其实分布式计算领域,从来都是应用拖着技术走的。例如是物理学家发明了集群运算,是信息检索工业逼出了MapReduce模式。所以,生物领域的挑战也许可以反过来给计算领域一个发展的机会。

敬畏自己的代码

  我一向对zf非常赞赏。zf在5年前刚来组里,不夸张地说,编程能力很弱。我记得是我告诉他应该用delete[]删除数组,而delete会导致内存泄漏。但是现在,同样不夸张地说,他写的每一行代码都是精品。BOSS H以前说:“joyfire的意思,好像如果zf愿意加入他的团队,他就有胆子去月球”,没错,我就是这个意思。

  为什么?简单统计了一下,zf总共提交了17万行代码(当然,我统计的方式很简陋,有很多代码在不同版本重复提交,也有很多是临时性脚本,和一线实战的工业级产品不同),他的pParse(包括此前的pCompare)大概进行了39个版本的迭代重构和不断改进。组里有很多编程天才,在SVN提交代码是爆炸式的(例如我经常惊叹,hchi哥能在一下午敲出我一周的满载工作量)。但zf是独特的,他在SVN里提交代码量统计,永远是一条平稳的直线,每天,每周,每个月……这家伙最让人感到恐惧的,就是近乎机械的执行能力。

  这不仅仅体现在他的代码上。我记得BOSS H有一次说他“过于内向,不敢发问,不善交流,会制约自己的发展”。那以后,我注意到,每次报告,无论规模场合和内容,zf必会举手向报告者提几个问题,一开始显得很勉强,时间长了,性格居然都变了。现在这家伙去全是老外的万圣节聚会上扮演凯撒大帝,谁敢再说他的内向障碍发展?BOSS H说,最早是在走廊上捡到个面试其他导师失败的彷徨考生,这么多年下来,zf教育了他这个导师。

  能进我们组的新生,很多都得过ACM或者各种比赛的金、银、铜、铁奖。但并不一定实现过真正实用的大系统,达到合格软件工程师的标准。前两天发邮件给全组,以zf为例子,建议大家踏实一点,敬畏自己的代码,最终有本事在pFind代码里,留下自己的长久印记。

“哪咤”系统第一个milestone明天组内alpha发布

  最近半个月和zf双人编程,全力开发“哪吒”系统

  和Web搜索引擎不同,pFind有自己的专业特点,最大特点就是要求尽可能高的查全率和查准率:平常Web搜索,大多数用户很少翻看三、五页以外更多的搜索结果。而对我们这个领域而言,质谱仪器原本就有很高的精度,数据集里那些含量稀少信号相对较弱的蛋白质,却往往正是科学家和医生们最重视的研究对象。pFind引擎有几十个参数,虽然很多默认推荐值经受过数百万数据的考验,但是在具体数据集上总会有优化的空间。

  也就是说,花费上万块钱成本得到珍贵实验数据,不能只拿pFind引擎常规参数“一搜了之”,需要进一步大量深入分析。例如过滤和校准质谱数据,极端情况下通过数据发现前端实验过程和仪器维护中的问题,回去重做实验;再如进行初次试搜之后,对鉴定结果进行统计分析,优化参数,改变选项,再进行迭代搜索;又如,利用多种搜索引擎交叉验证,甚至是多种方法结果的相互对比验证(例如常规的基因翻译蛋白序列库搜索引擎,最近新兴的谱库搜索,以及探索性最高的de novo算法)……

  每年iPRG国际评测中,使用同样搜索引擎的人,有排名前十位的,也有结果完全不像样子的。蛋白质组学的领军人物发表论文说:“赛车好,还需要王牌赛车手。生物组学领域的瓶颈目前是数据分析,而分析成果又很依赖优秀分析人员的经验”。像rxsun,yfu,zf,hchi,cliu等虾米,都是江湖上知名的质谱数据分析专家了。

  深入解析依赖人工工作大大降低了效率。随着pFind的进展,我们与越来越多国内外生物学家建立了合作关系。然而每月收到成T的数据,每周要提交的分析报告,海量的工作量,把所有人都压垮了,不得不放弃很多送上门来的合作机会和经费。

  数据解析的全流程自动化,国际上都在全力探索。例如现在红得发紫的Maxquant就做得不错。pFind组有不错的基础,由于有牛人yfu和他那一帮超常的算法研究天才在,我们在谱图搜索、de novo、修饰发现等等方面也有自己的独特之处。至于工程能力,生物实验室那些非计算机专业作者用C#写的桌面级别代码,要进一步拓展到大规模集群,甚至通过云计算提供在线服务,还差得很远。而这方面是计算所的长项。在超级计算机并行加速方面,我去年论文成果数据的可拓展性达到了320核,不谦虚地吹嘘,超过了全世界竞争对手目前为止公开报告数据效果一个量级,其实我已经做到了1000核,马上要做2000核。

  通过pFind云提供对外服务,这始终是我的梦想。为此努力了5年,今年终于开始着手了。也许,它能成为未来生物医药数据处理的基础构件之一。

  要建立实用的平台,就需要把方方面面的工作集成起来,解决大量很有挑战性的难题,例如单从并行加速这一个技术需求来看,不再像加速pFind搜索引擎这一具体环节这么简单,需要考虑整个流程各个环节的不同特点:在海量数据吞吐的地方,如全基因组翻译和索引,用得到MapReduce这一套;而打分鉴定这种95%CPU的工作,需要精心设计MPI类的程序;另一些特别的算法,例如de novo里面的动态规划,也许得考虑GPU加速……其实,加速还是最好做的事。

  加班很累,一写BLOG又兴奋了,也是最近密集开发,没空上网,想写的东西比较多吧。明天带“哪咤”的雏形出来见人,这个平台尝试把组里pFind、pBuild、pCluster、pMatch、pXtract、pParse、pNovo等等主力软件串联起来,无人值守的对海量数据进行尽可能全面的挖掘,我们希望听取各位的意见。

中国Mozilla Drumbeat大会和志愿计算

  志愿计算是指利用公众空闲CPU时间进行科学计算的技术。以前在BLOG上号召大家参加中国科学院的CAS@home项目,运行在这个平台上的蛋白质结构预测算法是由我们计算所生物信息组卜东波老师等开发的。

  3月23日至25日,高能物理所、计算所与Mozilla联合举办中国Mozilla Drumbeat大会。其专家报告部分的主题就是志愿计算,邀请了包括David Anderson在内的众多牛人。David Anderson来自加州大学伯克利分校,是著名的寻找外星人计划SETI@home的负责人,目前最著名的志愿计算平台BOINC的创始者。

  而后还有两天的Hackfests(现场工作室),这部分比较有趣。将安排4-5名开发人员和2-3名科学家组成一个小组,你会在领域专家的指导下,为一组Hackfest挑战项目开发出原型代码。Hackfests招募正在进行,软件牛人们赶快来报名。当然,因为Hackfests的任务具有延续性,需要您尽量保证两天全程参与。将为志愿者提供免费的饮料和午餐,以及24日的晚宴。

SD2.0和中国Cocoa移动开发者大会

  感谢Boss H的支持,我刚去上海参加了2010软件开发2.0大会。听的报告里,比较喜欢下面几个有干货的:

  多数报告很精致(例如美女西乔那个工业设计方面的报告),有的报告很好玩(例如淘宝的赵昆利用电子商务数据分析中国女性的胸围增长趋势)。

  反过来说,也有比较烦人的报告,一般都是各巨头(例如MS和IBM)派来的忽悠。在这么一个纯技术主题会议上,让很久没从事一线开发甚至缺乏基本技术Sense的管理或销售跑来“云”山雾绕,对口碑往往起反作用。

  移动开发慢慢变成热点。连hchi哥都在计划写个pLabel的智能手机版。这次碰到Tinyfool,他正忙着筹办1月8日的“中国Cocoa移动开发者大会”。同志们可以利用这个机会和iPhone开发牛人交流。

  这次的确有不少收获。遗憾的是缺乏系统级开发“硬核”方面的内容。例如没有C++、D或GO语言的讨论。大会重点是云,却没有Hadoop相关的报告。国内应该有很多做垂直搜索或专业搜索引擎的团队,例如我们pFind,我很希望能和这个圈子的朋友多交流。

  最后,人民搜索和我毫无瓜葛,大家不要再挖苦俺了。