Discover Top Posts Tagged with #infobright

Popular Recent

Alex Pinkin describes the difference a column store, Infobright, made to solving their problems implementing dashboards, reports, and alerts:

What is the secret sauce in Infobright? First, its column oriented storage model which leads to smaller disk I/O. Second, its “knowledge grid” which is aggregate data Infobright calculates during data loading. Data is stored in 65K Data Packs. Data Pack nodes in the knowledge grid contain a set of statistics about the data that is stored in each of the Data Packs. For instance, Infobright can pre-calculate min, max, and avg value for each column in the pack during the load, as well as keep track of distinct values for columns with low cardinality. Such metadata can really help when executing a query since it’s possible to ignore data packs which have no data matching filter criteria. If a data pack can be ignored, there is no penalty associated with decompressing the data pack.

Compared to our MySQL implementation, Infobright eliminated the need to create and manage indexes, as well as to partition tables.

Original title and link: An Infobright Column Store Use Case (NoSQL database©myNoSQL)

#Infobright #column store #Powered by NoSQL

So there you have the two approaches to handling machine-generated-data. If you have vast archives, EMC, IBM Netezza, and Teradata all have purpose-build appliances that scale into the petabytes. You also could use Hadoop, which promises much lower cost, but you’ll have to develop separate processes and applications for that environment. You’ll also have to establish or outsource expertise on Hadoop deployment, management, and data processing. For fast-query needs, EMC, IBM Netezza, and Teradata all have fast, standard appliances and faster, high-performance appliances (and companies including Kognitio and Oracle have similar configuration choices). Column-oriented database and appliance vendors including HP Vertica, InfoBright, ParAccel, and Sybase have speed advantages inherent in their database architectures.

I’m wondering why Hadoop is mentioned just in passing considering how many large datasets it is already handling.

Original title and link: 2 Ways to Tackle Really Big Data (NoSQL database©myNoSQL)

#BigData #Netezza #Infobright #EMC #Teradata

Columnar DBMS Vendor Customer Metrics

Very interesting customer base numbers for Sybase IQ, Vertica, SAND Technology, Infobright published by Curt Monash—most are in the hundreds, except for Sybase IQ.

This got me thinking what numbers would NoSQL companies have—is any of them sharing such numbers?. I’d speculate that most of them are in the tens, with 10gen (MongoDB) leading the space with probably a couple of hundreds at best.

Original title and link: Columnar DBMS Vendor Customer Metrics (NoSQL database©myNoSQL)

#Sybase IQ #Vertica #SAND Technology #Infobright #ParAccel #columnar database

Very interesting idea in the latest Infobright release:

The most interesting of the group might be Rough Query, which speeds the process of finding the needle in a multi-terabyte haystack by quickly pointing users to a relevant range of data, at which point they can drill down with more-complex queries. So, in theory, a query that might have taken 20 minutes before might now take just a few minutes because Rough Query works in seconds by using only the in-memory data and the subsequent search is against a much smaller data set.

Curt Monash provides more context about Rough Queries in his post:

To understand Infobright Rough Query, recall the essence of Infobright’s architecture:

Infobright’s core technical idea is to chop columns of data into 64K chunks, called data packs, and then store concise information about what’s in the packs. The more basic information is stored in data pack nodes,* one per data pack. If you’re familiar with Netezza zone maps, data pack nodes sound like zone maps on steroids. They store maximum values, minimum values, and (where meaningful) aggregates, and also encode information as to which intervals between the min and max values do or don’t contain actual data values.

I.e., a concise, imprecise representation of the database is always kept in RAM, in something Infobright calls the “Knowledge Grid.” Rough Query estimates query results based solely on the information in the Knowledge Grid — i.e., Rough Query always executes against information that’s already in RAM.

Rough Query is not meant for BI or reporting, but rather for initial investigations data scientists would perform against BigData.

Original title and link: Infobright Rough Query: Aproximating Query Results (NoSQL database©myNoSQL)

#Infobright #analytic database #column-oriented database #data warehouse

Curt Monash:

We’ll know they’re even more serious if they buy MySQL enhancements such as Infobright, dbShards, or Schooner MySQL

Why?

Original title and link: Oracle and MySQL Future (NoSQL databases © myNoSQL)

#MySQL #Oracle #dbShards #Infobright #Schooner MySQL

Using Redis to Manage Surrogate Keys

In the ad-tech industry, we get a a lot of traffic. One company I work with receives upwards of 3 billion events per month. The big guys do around 20 billion per day. To manage this amount traffic there are a lot of different data warehouse techniques that people employ to get every last bit of speed out of their apps, while preserving as much precious disk space as possible. While natural keys employ an automatic sense referential integrity, storing them is not very efficient. Surrogate keys, on the other hand, can be expensive to create and manage, plus a lot of datawarehouse solutions dont even offer the ability to auto-increment or even provide constraints. Such is the case of InfoBright.

The problem(s): Creating a surrogate key requires doing a search on the columns values in a particular dimension table to find if a row exists, then creating a new one that is one more than the last one created.

While doing a lookup for the keys, you will most likely hinder performance and/or lock tables.

Every RDBMS datawarehouse solution is much faster at bulk loading than individual inserts.

My solution: The idea is to use Redis in your ETL process to take care of the heavy lifting. Imagine you have one fact table call event_facts and one dimension table called users_dim.

Here is what one row in out psv file may look like:

So, from this we know that John Doe made a purchase for socks at the price of 37.50. To split up the data into our table schema it may look like this:

purchase, socks, 37.50, {user_id} {user_id}, John, Doe, male

Que Redis, here I have made a simple little Ruby script that will get us our surrogate key:

So in this example, we are accomplishing the following process:

Creating a hash of the data ("john,doe,male")

Checking redis to see if we have the key "users_dim/#{hash}" and return it if we do

If do not have the key, increment the key "users_dim/key" which stores our latest value

Set the key "users_dim/#{hash}" to the incremented value, so next time we will have the key

On a 32gb machine, we had well over 40m keys, with room to spare. Because redis is an in memory store, it is insanely fast at these lookups, much faster than the db itself. After you get all of your surrogate keys, just output the file in a format that can be loaded into the db and go.

#redis #infobright #surrogate key #natural key #data warehouse