Counting Categories & Removing Duplicate Rows in Google OpenRefine
Using OpenRefine, how can I capture the facet counts which are so easily displayed in the facet windows? (Another way to state the problem: how can I group duplicate rows and count the frequency of duplicates?)
Open a project in Refine
Sort... the desired column (the column for which you want to copy the count of duplicate facets. There will be an option in the column dialog-box)
Sort > Reorder rows permanently: Using the Project Sort command that just appeared at the top of the project table (This is destructive in that it will resort the order of the how the data is stored So, you might want to operate on a copy of your data here)
Add a column using the facetCount expression: from-the-column-header dialog box > Edit column > Add column based on this column... / Expression = facetCount(value, ‘value’, ‘Zip Code’) where 'Zip Code' is the column name for which facets will be counted (i.e. the column name for which you chose the command "Add column based on column...")
Now deduplicate the rows:
From the original column header (in this case "Zip Code") > Edit cells > Blank down -- blank down means remove the duplicate entries in subsequent rows. This is possible because of the resorting from steps 2 and 3
Remove "blanks" (i.e. the duplicates). From the original column header > Facet > Customized facets > Facet by blank
Click the "true" facet to invoke limiting the those duplicate rows for which the duplicate cells have now been "blanked"
All > Edit rows > Remove all matching rows
Close the fact (e.g. "Zip Code" facet)
Optional, sort on the column header of the newly added facet count, in this case "Zip Count" Zip Count > Sort ... > numbers > largest first
Export desired output
This example is built on blogs and tutorials including http://davidhuynh.net/spaces/nicar2011/tutorial.pdf & http://googlerefine.blogspot.com/2011/08/remove-duplicate.html
The practice data I used for this session is available here http://duke.box.com/refine







