Discover Top Posts Tagged with #datasketches

Overview of immigration in five countries

by u/DataSketches

Apache Data Sketches in BigQuery: Quick Analytics at scale

Fast, approximate, large-scale analytics: BigQuery supports Apache Data Sketches.

Understanding large datasets in today's data-driven environment sometimes requires complex non-additive aggregation methods. As data grows to large sizes, conventional methods become computationally expensive and time-consuming. Apache DataSketches can assist. Apache Data Sketches functions are now accessible in BigQuery, providing powerful tools for large-scale approximation analytics.

Apache Data Sketches What?

Software library Apache DataSketches is open-source. Its sketches are probabilistic data structures or streaming algorithms. These sketches effectively summarise large datasets. It is a "required toolkit" for systems that must extract useful information from massive amounts of data. Yahoo started working on the project in 2011, released it in 2015, and still uses it.

Essential Features and Goals:

Apache Data Sketches aims to provide fast, approximate analytics on massive datasets at scale. Conventional approaches for count distinct, quantiles, and most-frequent item queries in big data analysis take a lot of time and computational resources, especially when the data is large (typically more than random-access memory can hold).

DataSketches helps users quickly extract knowledge from enormous datasets, especially when accurate computations are not possible. If imprecise results are acceptable, sketches can produce results orders of magnitude faster. Sketches may be the sole response for interactive, real-time enquiries.

It works:

Big data is summarised well by sketches. One data pass and low memory and computational cost are typical. These tiny probabilistic data structures enable accurate estimations.

Merging sketches, which makes them additive and parallelizable, is essential. Combining drawings from many databases allows for further analysis. The combination of compact size and mergeability can boost computing task speed by orders of magnitude compared to conventional approaches.

Important features and benefits include:

Fast: Sketches can batch and real-time process data in one pass. Data sketching reduces big data processing times from days to minutes or seconds.

Efficiency: Low memory and computational overhead. They save resources by reducing query and storage costs compared to raw data. Sketching-focused systems feature simpler architectures and use less computer power.

Accuracy: Sketches accurately approximate histograms, quantiles, and distinct counts. The biggest potential difference between an actual value and its estimated value is reflected by mathematically specified error bounds in all but a few sketches. The user can adjust these error limitations to balance sketch size and error bounds; larger sketches have smaller error bounds.

Scalability: The library is designed for large-data production systems. It helps analyse massive volumes of data that random-access memory cannot hold.

Interoperability: Apache Data Sketches may be transported between systems and interpreted by Java, C++, and Python without losing accuracy because to their explicitly defined binary representations.

Theta Sketch's built-in set operators (Union, Intersection, and Difference) enable set expressions like ((A ∪ B) ∩ (C ∪ D)) \ (E ∪ F) that yield sketches. For rapid queries, this function gives unprecedented analytical choices.

Important Sketch Types (BigQuery-Integrated Examples):

The library contains analytical sketches of several types:

Cardidality Sketches: Estimate count variations. Theta Sketch for distinct counting and set expressions, Hyper Log Log Sketch (HLL) for simple distinct counting, CPC Sketch for accuracy per stored size, and Tuple Sketch, which builds on Theta Sketch to link additional values to distinct items for complex analysis.

Quantile sketches evaluate values at percentiles or rankings like the median. REQ Sketch is designed for higher accuracy at the rank domain's ends, KLL Sketch is known for statistically optimal quantile approximation accuracy for a given size and insensitivity to input data distribution, and T-Digest Sketch is a quick, compact heuristic sketch (without mathematically proven error bounds) for strictly numeric data.

Frequency drawings identify events that occur more often than a threshold. The Frequent Things Sketch, also known as the Heavy-Hitter sketch, may detect frequent items in one pass for static analysis or real-time monitoring.

Apache Data Sketches is a strong collection of specialised algorithms that enable fast, accurate, and exact approximate analysis on massive datasets in big data environments such cloud platforms like Google Cloud BigQuery.

#ApacheDataSketches #DataSketches #bigdatasets #BigQuery #randomaccessmemory #ApacheData #technology #technews #technologynews #news #govindhtech

Combining Druid and DataSketches for Real-time, Robust Behavioral Analytics

By Himanshu Gupta

Millions of users around the world interact with Yahoo through their web browsers and mobile devices, generating billions of events every day (e.g. clicking on ads, clicking on various pages of interest, and logging in). As Yahoo's data grows larger and more complex, we are investing in new ways to better manage and make sense of it. Behavioral analytics is one important branch of analytics in which we are making significant advancements, and is helping us accomplish these tasks.

Beyond simply measuring how many times a user has performed a certain action, we also try to understand patterns in their actions. We do this in order to help us decide which of our features are impactful and might grow our user base, and to understand responses to ads that might help us improve users’ future experiences.

One example of behavioral analytics is measuring user retention rates for Yahoo properties such as Mail, News, and Finance, and breaking down these rates by different user demographics. Another example is to determine which ads perform well for various types of users (as measured by various signals), and to serve ads appropriately based on that implicit or explicit feedback.

The challenges we face in answering these questions mainly concern storing and interactively querying our user-generated events at massive scale. We heavily make use of distributed systems, and Druid is at the forefront of powering most of our real-time analytics at scale.

One of the features that makes Druid very useful is the ability to summarize data at storage time. This leads to greatly-reduced storage requirements, and hence, faster queries. For example, consider the dataset below:

This data represents ad clicks for different website domains. We can see that there are many repeated attributes, which we call “dimensions,” in our data across different timestamps. Now, most of the time we don’t care that a certain ad was clicked at a precise millisecond in time. What is a lot more interesting to us, is how many times an ad was clicked over the period of an hour. Thus, we can truncate the raw event timestamps and group all events with the same set of dimensions. When we group the dimensions, we also aggregate the raw event values for the “clicked” column.

This method is known as summarization, and in practice, we see summarization significantly reduce the amount of raw data we have to store. We’ve chosen to lose some information about the time an event occurred, but there is no loss of fidelity for the “clicked” metric that we really care about.

Let’s consider the same dataset again, but now with information about which user performed the click. When we go to summarize our data, the highly cardinal and unique “user-id” column prevents our data from compacting very well.

The number of unique user-ids could be very high due to the number of users visiting Yahoo everyday. So, in our “user-id” column, we end up effectively storing our raw data. Given that we are mostly interested in how many unique users performed certain actions, and we don’t really care about precisely which users did those actions, it would be nice if we could somehow lose some information about the individual users so that our data could still be summarized.

One approach to solving this problem is to create a “sketch” of the user-id dimension. Instead of storing every single unique user-id, we instead maintain a hash-based data structure – also known as a sketch – which has smaller storage requirements and gives estimates of user-id dimension cardinality with predictable accuracy.

Leveraging sketches, our summarized data for the user dimension looks something like this:

Sketch algorithms are highly desirable because they are very scalable, use predictable storage, work with real-time streams of data, and provide predictable estimates. There are many different algorithms to construct different type of sketches, and a lot of fancy mathematics goes into detail about how sketch algorithms work and why we can get very good estimations of results.

At Yahoo, we recently developed an open source library called DataSketches. DataSketches provides implementations of various approximate sketch-based algorithms that enable faster, cheaper analytics on large datasets. By combining DataSketches with an extremely low-latency data store, such as Druid, you bring sketches into practical use in a big data store. Embedding sketch algorithms in a data store and persisting the actual sketches is relatively novel in the industry, and is the future structure of big data analytics systems.

Druid’s flexible plugin architecture allows us to integrate it with DataSketches; as such, we’ve developed and open sourced an extension to Druid that allows DataSketches to be used as a Druid aggregation function. Druid applies the aggregation function on selected columns and stores aggregated values instead of raw data.

By leveraging the fast, approximate calculations of DataSketches, complex analytic queries such as cardinality estimation and retention analysis can be completed in less than one second in Druid. This allows developers to visualize the results in real-time, and to be able to slice and dice results across a variety of different filters. For example, we can quickly determine how many users visited our core products, including Yahoo News, Sports, and Finance, as well as see how many of those users returned some time later. We can also break down our results in real-time based on user demographics such as age and location.

If you have similar use cases to ours, we invite you to try out DataSketches and Druid for behavioral analytics. For more information about DataSketches, please visit the DataSketches website. For more information about Druid, please visit the project webpage. And finally, documents for the DataSketches and Druid integration can be found in the Druid docs.