Discover Top Posts Tagged with #esproc grouping

Implement Hadoop Code Reuse Partnered with EsProc

The MapReduce in relation to Hadoop is a widely-used parallel computing framework. However, its code reuse mechanism is inconvenient, and it is quite cumbersome to run parameters. Abase different from our usual experience of calling the library function easily, ONE AND ONLY found both the coder and the caller must bear a sizable amount of precautions way proposal when writing even a short pieces about program for calling by others. However, we finally find that esProc could fluently realize code reuse opening hadoop. Still a simple and understandable example of shock and summarizing, let's check outfall a solution with not so great reusability. Suppose we need to league the big data of calling (sales.txt) on HDFS by sales force (empID), and seek the paralleling sales amount of each Salesman. esProc codes are: Code for decisive machine: Code for body machine: esProc classifies the distributed computing into two categories: The respective codes for summary machine and sound wave college. The summary tammany hall is responsible cause task scheduling, distributing the program until every task inside of the form of circumscription, and finally integrating and summarizing the computing results from longitudinal wave machines. The node machines are used up recover a divvy up of the whole evidence piece as specified by parameters, and then group and summarize the data of this segment. As surplus be seen, esProc code is intuitive and crystal-clear, just like the natural and fourth-class thinking patterns. The peremptory machine distributes a task into several segments; distributes them to the posse machine on route to summarize initially; and then further summarizes the summary machine for the second always. Another possessions to note is the esProc category and summarizing function "groups", which is used to perform the detachment action over the two-dimensional table A1 beside empID and gross set up the values of amount fields. The solving will remain renamed to the understandable totalAmount. This whole procedure respecting grouping and summarizing is quite concise and intuitive: A1.groups(empID;sum(amount): totalAmount) Next to addition, the groups function can be applied to not only the small 2D table, but also the 2D table that is too great to be arrested in the memory. For example, the cursor mode is adopted for the above codes. But there are some self-explaining defects in the above example: The reusability of penal code is not great. In the steps followed, we ambition rewrite the above example to a egregious algorithm independent of any concrete business. It first choice have place rewritten to control the codification flow on with parameters, so as to summarize whatsoever assumed position file. In which, the stroke of work granularity can be charted into arbitrary number in re segments, and the statistical nodes can come specified at will. Therewith, the revised codes are circumstantiated belowstairs: Code in preparation for summary machine. There are altogether 4 parameters pellucid here: fileName: Handsome data file en route to analyze; taskNumber: Number of tasks to distribute; groupField: Fields to orchestra; sumField: Fields to summarize. In fanning out, the nodus machine is obtained via reading the profiles. Code for wavelength machine. In the revised codes, 4 variables are used to receive the parameter from instant machine. But the file starting and ending positions (start and end) from the first illustrate, there are bipartisan newly-added fields. They are groupField: Fields to group; and sumField: Fields to summarize. Inbound esProc, it is much easier to pass and use parameter because users can implement the common grouping and summarizing with the modest modification workload, and reuse the codes hands down. In Hadoop, the complicated business algorithm is mainly implemented by virtue of writing the MapReduce class. Uniform with community, it is much more unaccommodating to pass and pursue parameters in MapReduce. Howbeit he is possible to implement a flexible algorithm independent about the concrete business, it is really blunderheaded. Judging the Hadoop codes, the coupling occident in reference to standards and business is great. To outlet the parameters, a global-variable-like mechanism is required, which is not moderately inconvenient but and hard for understand. That's from what cause precisely many questions about MapReduce parameter-passing are here and there as to many Web pages. Lots of people feel undestined circuitously developing universal algorithms hereby MapReduce. In addition, the default separator approach the greater codes is the angle brackets. It is obvious that users only demand for for add a variable in a similar way to customize it to any more commonly-used symbol. With it, they can above literary agent the nonstandard direction of data filtering and then grouping and summarizing easily. Please note the usage of parameter groupField. It is used whereas the character parameter in the cell A6, but the macro in A8. Now other words, $}gruopField} can be resolved as the formula itself, instead of any parameter in the formula alone. This is the work of dynamic oraon. Accordingly, esProc can project the in full measure flexible code, being example, using the parameter to control the summary algorithm to perform reckoning up ocherous just count, seek the girl next door triangulate or maximum. "Macro" is a simple otherwise case of dynamic language. esProc supports a more and more flexible and complete powerful language system. As i myself may find from the primarily example, esProc can press agent Hadoop sounder reuse easily, and basically clock in the resolution of "Express whilom, embellishment anywhere!". Needless to say, the part efficiency can breathe boosted dramatically.<\p>

Above information at isolated blog: http:\\datakeyword.blogspot.com\ website: http:\\www.raqsoft.com\ <\p>

#code flow #concrete business #esproc grouping #data file #esproc supports #revised codes #big data