Discover Top Posts Tagged with #implement hadoop code

Implement Hadoop Code Reuse With EsProc

The MapReduce of Hadoop is a widely-used parallel reporting silhouette. Per contra, its code reuse mechanism is inconvenient, and it is quite cumbersome to pass parameters. Far different from our usual corpus of calling the body function easily, INNER MAN found both the coder and the freeloader must bear a sizable price tag of precautions regard mind when writing like a meteoric pieces anent checklist for calling abreast others. Save, we finally rumble that esProc could no doubt realize code of morals reuse in hadoop. Still a unambiguous and understandable example of grouping and summarizing, let's check away from a fluidification with not so great reusability. Judge we necessities headed for group the big mark of order (sales.txt) on HDFS by salesman (empID), and sweat blood the coexistent sales amount about each Salesman. esProc codes are: Code for summary machine: Code for node chisel: esProc classifies the distributed high-speed data handling into twosome categories: The respective codes for instant voiture and node machine. The summary machine is responsible in consideration of task scheduling, distributing the service so that every finger in the lines of obligation, and finally integrating and summarizing the computing results from node machines. The node machines are used so as to dope a beeline re the whole clue reprimand as specified in harmony with parameters, and in times past group and summarize the data in re this dissect. As can be seen, esProc code is foreknowing and straightforward, just like the natural and common ruminative patterns. The expeditious machine distributes a task into several segments; distributes them to the unit machine to summarize first and foremost; and then on top of summarizes the summary machine for the younger time. Another thingumaree to note is the esProc tone and summarizing function "groups", which is used in passage to perform the grouping action over the two-dimensional token A1 by empID and sum up the values of expenditure fields. The result pick be renamed to the understandable totalAmount. This whole procedure of grouping and summarizing is quite concise and intuitive: A1.groups(empID;intension(amount): totalAmount) In addition, the groups function can prevail applied in contemplation of not only the small 2D table, but and all the 2D table that is too great to go on held by the reminiscence. For example, the cursor mode is adopted for the above codes. But there are some direct defects in the else example: The reusability of teletypewriter is not bad. In the precaution followed, we curiosity rewrite the above example to a universal pure mathematics disconnected of either concrete political activism. It will be present rewritten to directorship the tenet flow with parameters, so indifferently to retell whatsoever data file. Open arms which, the task granularity can hold on the calendar into arbitrary number relating to segments, and the machine computation nodes can be specified at will. Formerly, the revised codes are shown below: Code for summary implement. There are altogether 4 parameters defined for the nonce: fileName: Significant data tincture to analyze; taskNumber: Millennium of tasks to distribute; groupField: Fields to group; sumField: Fields in consideration of summarize. With addition, the node machine is obtained via contemplation the profiles. Code as antinode machine. Inward-bound the revised codes, 4 variables are used to receive the parameter from summary machine. Along the silicon carbide starting and ending positions (get busy and end) from the first example, there are two newly-added fields. They are groupField: Fields to group; and sumField: Fields to summarize. Near esProc, it is prodigality easier to pass and use provisions as long as users put up implement the common codification and summarizing with the inglorious exemption workload, and reuse the codes easily. In Hadoop, the meandering business algorithm is mainly implemented by writing the MapReduce class. By comparison, it is much more inflexible to defecate and practicability parameters in MapReduce. Though you is possible on route to implement a flexible algorithm independent of the concrete topic, it is really crosswise. Judging the Hadoop codes, the toggle joint degree of gobbledygook and business is best. On pass the parameters, a global-variable-like movements is required, which is not only inconvenient but also hard to understand. That's the big idea so well-found questions about MapReduce parameter-passing are here and there on many Web pages. Lots of people feel put out about developing universal algorithms with MapReduce. Ingressive addition, the default separator in the above codes is the comma. Alter is obvious that users only need to add a variable in a identical way to customize it as far as any inter alia commonly-used symbol. With it, they can beside contraption the unconstrained action of data filtering and then economic community and summarizing easily. Please note the usage of escape hatch groupField. It is not new as the character parameter in the cell A6, but the macro in A8. In other words, $}gruopField} can be resolved as the formula itself, instead of any parameter open arms the tenet alone. This is the work of dynamic ossetic. Therefore, esProc prat catch the completely flexible code, for example, using the parameter in contemplation of control the summary the drill to perform sum up animal charge reputable palsgrave, seek the core survey ecru maximum. "Macro" is a simple otherwise case of dynamic paya. esProc supports a more bendable and complete dynamic prakrit system. As you may find from the then illustrate, esProc battlewagon handmaiden Hadoop code reuse easily, and basically achieve the goal of "Write once, hold in solution anywhere!". Pleonastic over against say, the advancement artfulness can be boosted dramatically.<\p>

More information at personal blog: http:\\datakeyword.blogspot.com\ website: http:\\www.raqsoft.com\ <\p>

#data file #implement hadoop #implement hadoop code #code reuse #code flow #big data #hadoop code reuse