Business Intelligence Shared @ryanshirley - Tumblr Blog

Hey! would you ever sell the domain?

Please contact me directly @ryantshirley on twitter.

Loading millions of XML files into Hive

A couple learnings found when trying to load many (4.5MM) XML files into a Hive table.

1) It can be tricky to unzip the compressed file that contains all the small XML files. Linux unzip command might fail with the “ End-of-central-directory signature not found “ error. This does NOT necessarily mean that your zip file is corrupt. If your zip file is very large, over 6GB in my case, the unzip command does not seem to be able to read far enough into it to find the central directory. I found this great solution on AskUbuntu. Just use jar xvf <zip filename>. This command does the same as unzip, but seems not to care about finding the directory. It just starts churning through files.

2) To work with very large numbers of files in a directory, you sometimes need to use the find command. For example, if you want to change permissions on 4.5MM files, the chmod 666 * command will fail with “argument list too long”. Instead, try using find and pipe the output to the chmod command with the use of the xargs command.

find . -name "*.xml" -print0 | xargs -0 chmod 666

3) If you simply put the XML files into HDFS and then point an EXTERNAL table at the HDFS folder, you will get nothing from the table if *any* of your XML files are invalid. With this many files, it can be difficult to find the culprit files, so an alternate approach was needed. I used Apache Nifi to read the files from the local filesystem and then used the “EvaluateXPath” processor to pick out the elements of the XML file I needed.

EvaluateXPath put the elements into attributes on the flowfiles and then I was able to use ReplaceText to make the content of the flowfiles be a comma separated list of the elements. From there, I needed only to write them to an HDFS folder that was the source of an EXTERNAL table in Hive.

Watch out for multiple matches coming from the EvaluateXPath processor. You can filter these out by doing a SELECT DISTINCT from your EXTERNAL table at the end.

Hi Ryandud you ever live in valley ridge in Calgary as I have a bunch of mail for you. Thx. Scott H

Hey Scott! Yes, you’ve got the right Ryan. Sorry, I didn’t see this until now. If you still have the mail, I will come by some evening this week. If not, then no big deal. Hope all is well.

Odd behaviour when running Hive queries with ORDER BY using different drivers

When running a recent query with ORDER BY in it, we noticed very odd behaviour depending on which tool we were running in. When using Ambari Hive View or Toad Data Point, the query ran fine. When running in SQuirrel or Tableau Desktop, the query failed with an error:

"org.apache.hive.service.cli.HiveSQLException:Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:71975 Invalid table alias or column reference '_SYNTHJOIN_7a854c02___genesys_wfm_wfm_sitewfm_site_name_': (possible c"

After a bunch of trial and error, and looking at the Hive Wiki, we tried swapping out the ORDER BY for SORT BY. That seemed to improve things, but it still choked on one of the columns. That error was more straightforward. We just needed to make sure that all the columns in the SORT BY were also in the SELECT list.

We are using a Hortonworks JDBC driver for SQuirrel and probably same or perhaps the bundled Hortonworks Hive driver from Tableau Desktop. Both must be a little behind the times and only support SORT BY syntax. Our version of Hive, 1.2.1, supports ORDER BY syntax, but the driver does not seem to know that

#apache hive #tableau #squirrel #jdbc

AWS Elastic Beanstalk vs EC2

I’ve been on Beanstalk for about 3 years building a couple web apps (http://rentersEdge.ca for example). I also have a single EC2 instance I use for ancillary work. I can mostly speak to Beanstalk because I haven’t run my apps of off plain EC2. Beanstalk has a great little command line tool for doing deployments. It looks at what you last committed with git and updates the right files on your remote servers. It also does a few things to get you up and running more quickly, like installing the web server on one machine and the database on another and letting them talk to each other. That said, I think I made a mistake in going straight to beanstalk for my small, MVP web app. Having everything on a single EC2 vm would have saved me a bit of hassle and probably some money. Disadvantages of beanstalk:

Doing custom config is hard - especially creating a custom httpd.conf for your app. Seems to be no easy way despite much research. Also watch out - if you do ANY custom config by SSH into your server it is considered “out of band” and gets wiped out whenever your server is replaced. Server can get replace at random times if AWS thinks they are out of date or unhealthy. Or if you turn them off by accident.

Second, with an EC2 instance, you can turn it off and on at any time and save money. You can have everything on one and save money too. For large operations, this won’t matter, but for a bootstrapped startup, this makes a difference.

Third, I find it a bit harder to give access to Beanstalk to 3rd parties. With a simple vm, you just give them an ftp account. For beanstalk, you have to figure out IAM roles and get them to install the beanstalk CLI client.

Beanstalk is a good product and really a good fit if you know your service is going to grow. You get load balancers and auto scaling configured automatically, which is way out of my domain. But if you are just doing a POC or an MVP and want to see if it works, and you have some experience configuring things yourself, then consider a single EC2 instance.

For all you Mac users still on Snow Leopard (10.6) for whatever reason, I have gone through the pain to figure out which is the latest version of MySQL Workbench you can use. You may have found that the very latest and greatest (6.3 at this writing) needs 10.8. If you installed it on Snow Leopard you are now stuck searching through blogs and docs to find out which is the latest to work with your OS. Finally, I found the answer on the WB blog. The answer appears to be: MySQL Workbench 6.1.7 GA When they went to 6.2, they started requiring OS X 10.7 or later. I'm about to d/l and install, so if something goes wrong, I'll be revising this post :)

Integrating healthcare systems now requires practitioners to be aware of the ReST vs SOAP debate. Based on anecdotal evidence, it would seem SOAP was the de facto protocol for encoding messages between healthcare apps until recent increased use of ReST web services. The world of web and app development has grown up around using ReST to move data. As web based services have grown in popularity and the tooling for building and using them has become more ubiquitous and developer friendly, ReST has begun to gain a foothold in healthcare system integration. Here are a few points I made after a brief readings on ReST vs SOAP on the web.

SOAP

- A protocol, based on XML, used for Remote Procedure Call (RPC)

- Simply moving data (patient records etc.) is a subset of RPC and is achievable with SOAP

- Heavy weight protocol because of the XML “envelope” that must wrap every message

- More common, and still well suited for App to App, App to Server, and Server to Server communication

- May ease programming because of things like the WSDL that describes the implementation of the protocol and allows developers to easily construct objects from messages

- OO languages have libraries to consume SOAP messages and build objects

- Seems to necessitate reinvention of each verb that acts on a service, whereas the verbs are standardized in ReST (GET, PUT, DELETE etc)

ReST

- An Architecture, based on HTTP

- Leverages the HTTP CRUD verbs (GET, PUT, DELETE, …)

- Enables easier caching because the verbs are already known

- How the data is sent is not standardized. Can be HTML, XML, JSON, etc.

- Lightweight because no wrapper

Thanks to several StackOverflow posts and other blogs for this info.

#SOAP #ReST #web services

Materialized Views - Oracle and SQL Server

This is one of those “I’m blogging again” posts. Like so many others, I let my blog sit idle for quite a while and now I’m writing again. I’m not going to make any promises about future post frequency, but I will say that this quick notes sheet I wrote really helped cement a couple concepts in my mind and I plan on doing more of them. When I do, I’ll share them.

For anyone who has wondered what a materialized view is and whether they are different on two of the most popular DBMS, here’s a quick rundown. In particular, this page should answer the question “how are materialized views refreshed on Oracle and SQL Server?”

*Oracle Materialized Views:*

- For replicating data to non-master sites in a replication environment

- For caching expensive DW queries

- Replica of a table or view taken from a single point in time ( *snapshot*)

- “refresh” of the mat. view is done intelligently - the “fast refresh” checks for changes in the master and does nothing if there were no changes

- Reason for use:

Ease network load

Instead of all offices accessing the central db, replicate some data to

remote sites

§ Less network use than “multi-master” replication which is real-time and

transactional, requiring a dedicated network link

§ Data subsetting can reduce network load

o Mass deployment environment

§ Create a mat. view environment locally

§ Deploy to many users

o Data subsetting

§ Security, performance…

o Disconnected Computing

§ Sales person in field can refresh when they have a connection

- Different types of mat. views:

o Read only

- No DML allowed on the view

o Updateable

§ DML allowed on view

§ Changes made on a remote site are pushed back to the master

§ Must below to a “material view group”

o Writeable

§ DML allowed

§ Changes are not pushed back to server - changes are lost when view is

refreshed

*SQL Server Indexed View*

- Adding a unique clustered index to a view makes it a “materialized” view immediately

- The query optimizer will substitute the indexed view for a table (or other view?) in a query automatically if it will improve performance. The query does not have to be rewritten - allowing the application code to stay the same

- Can have major improvement of performance over just adding indexes to tables

The view can do the aggregation, joins and filtering ahead of time

Q: When are indexed views refreshed?

A: They are refreshed at the moment the underlying tables are refreshed. Like an index on a table, the clustered index on the view is updated when changes to the table occur.

- Indexed views must have the unique constraint on clustered indexes - this makes updates faster

- Adding a clustered index to a view makes updating the underlying tables slower and should only be done when query performance is a priority

Use when underlying data is fairly static

Use when the view processes many rows

Use when the view is referenced by many queries

- The indexed view is stored in the DB the same way a clustered index would be stored

#sql server #oracle #database

BusinessObjects Patch Levels

Here is a mapping of BusinessObjects version numbers to patch levels for XI R2, 3.1 and 4.0. Thanks to forumtopics.com and bobj.sapbiblog.com for the data. BO XI 3.1 SP5 12.5.0.1190 --> XI 3.1 SP5 BO XI 3.1 SP4 12.4.0.966 --> XI 3.1 SP4 12.4.1.1188 --> XI 3.1 SP4 + FP 4.1 BO XI 3.1 SP3 12.3.0.601 --> XI 3.1 SP3 12.3.1.684 --> XI 3.1 + SP3 + FP3.1 12.3.2.753 --> XI 3.1 + SP3 + FP3.2 12.3.3.812 --> XI 3.1 + SP3 + FP3.3 12.3.4.895 --> XI 3.1 + SP3 + FP3.4 12.3.5.925 --> XI 3.1 + SP3 + FP3.5 12.3.6.1006 --> XI 3.1 + SP3 + FP3.6 BO XI 3.1 SP2 12.2.0.290 --> XI 3.1 + SP2 12.2.1.412 --> XI 3.1 + SP2 + FP2.1 12.2.2.441 or 12.2.2.453 --> XI 3.1 + SP2 + FP2.2 12.2.3.467 --> XI 3.1 + SP2 + FP2.3 12.2.4.507 --> XI 3.1 + SP2 + FP2.4 12.2.5.537 --> XI 3.1 + SP2 + FP2.5 12.2.6.567 --> XI 3.1 + SP2 + FP2.6 12.2.7.598 --> XI 3.1 + SP2 + FP2.7 12.2.8.638 --> XI 3.1 + SP2 + FP2.8 12.2.9.698 --> XI 3.1 + SP2 + FP2.9 12.2.10.810 --> XI 3.1 + SP2 + FP2.10 BO XI 3.1 12.1.0.882 --> XI 3.1 12.1.1.957 --> FP1.1 12.1.2.996 --> FP1.2 12.1.3.1028 --> FP1.3 12.1.4.1068 --> FP1.4 12.1.5.1096 --> FP1.5 12.1.6.1116 --> FP1.6 12.1.7.1144 --> FP1.7 12.1.8.1176 --> FP1.8 12.1.9.1290 --> FP1.9 12.1.10.1390 --> FP1.10 12.1.11.1454 --> FP1.11 BO XI 3.0 12.0.0.683.BOE_Titan_RTM --> XI 3.0 12.0.0.796.BOE_Titan_FP_RTM_REL --> FP 1 12.0.0.840.BOE_Titan_FP_RTM_REL --> FP 2 12.0.3.900 BOE_Titan_FP_RTM_REL --> FP 3 BO XI r2 11.5.0.313 --> RTM 11.5.3.417 --> SP1 11.5.7.608 --> MHF1 11.5.7.682 --> CHF13 11.5.7.723 --> CHF14 11.5.7.770 --> CHF15 11.5.7.811 --> CHF16 11.5.7.852 --> CHF17 11.5.7.1048 --> CHF20 11.5.8.826 --> SP2 11.5.8.826 --> FP2.1 11.5.8.939 --> FP2.2 11.5.8.963 --> FP2.3 11.5.8.998 --> FP2.4 11.5.8.1030 --> FP2.5 11.5.8.1061 --> FP2.6 11.5.8.1111 --> FP2.7 11.5.8.1187 --> FP2.8 11.5.8.1237 --> FP2.9 11.5.9.1076 --> SP3 11.5.9.1101 --> FP3.1 11.5.9.1138 --> FP3.2 11.5.9.1176 --> FP3.3 11.5.9.1202 --> FP3.4 11.5.9.1227 --> FP3.5 11.5.9.1265 --> FP3.6 11.5.9.1330 --> FP3.7 11.5.9.1378 --> FP3.8 11.5.9.1449 --> FP3.9 11.5.10.1263 --> SP4 11.5.10.1298 --> FP4.1 11.5.10.1322 --> FP4.2 11.5.10.1358 --> FP4.3 11.5.10.1404 --> FP4.4 11.5.10.1426 --> FP4.5 11.5.10.1460 --> FP4.6 11.5.10.1491 --> FP4.7 11.5.10.1554 --> FP4.8 11.5.10.1619 --> FP4.9 11.5.10.1681 --> FP4.10 11.5.10.1765 --> FP4.11 11.5.10.1854 --> FP4.12 11.5.11.1470 --> SP5 11.5.11.1545 --> FP 5.1 11.5.11.1578 --> FP 5.2 11.5.11.1610 --> FP 5.3 11.5.11.1639 --> FP 5.4 11.5.11.1673 --> FP 5.5 11.5.11.1732 --> FP 5.6 11.5.11.1797 --> FP 5.7 11.5.11.1856 --> FP 5.8 11.5.12.1838 --> SP6 RTM = Release To Market (first release, without any patches). MHF = Monthly Hot Fix. CHF = Critical Hot Fix. SP = Service Pack. FP = Fix Pack. *XI 4.0* *BI 4.0 RTM**14.0.0.760*Patch 0414.0.0.904Patch 0514.0.0.918Patch 06 14.0.0.940Patch 07 14.0.0.954Patch 0814.0.0.970Patch 0914.0.0.986Patch 10 14.0.0.996*Support Pack 1**14.0.1.287* Patch 1.114.0.1.313Patch 1.2 14.0.1.330Patch 1.314.0.1.342Patch 1.414.0.1.360Patch 1.514.0.1.375 Patch 1.614.0.1.397*Support Pack 2**14.0.2.364*Patch 2.114.0.2.388Patch 2.2 14.0.2.397 Patch 2.314.0.2.416Patch 2.414.0.2.439Patch 2.514.0.2.455Patch 2.614.0.2.481Patch 2.714.0.2.508 Patch 2.814.0.2.532Patch 2.914.0.2.556Patch 2.1014.0.2.565Patch 2.1114.0.2.594Patch 2.1214.0.2.619 Patch 2.1314.0.2.641Patch 2.1414.0.2.657Patch 2.1514.0.2.682Patch 2.1614.0.2.703Patch 2.1714.0.2.721 Patch 2.1814.0.2.740Patch 2.1914.0.2.762Patch 2.2014.0.2.798Patch 2.2114.0.2.846*Feature Pack 3* *14.0.3.613*Patch FP3.0114.0.3.630Patch FP3.0214.0.3.657Patch FP3.03 14.0.3.678Patch FP3.0414.0.3.691 Patch FP3.0514.0.3.711Patch FP3.06 14.0.3.735Patch FP3.0714.0.3.757Patch FP3.0814.0.3.777Patch FP3.09 14.0.3.796Patch FP3.1014.0.3.818*Support Pack 4**14.0.4.738*Patch 4.1 14.0.4.770Patch 4.2 14.0.4.797Patch 4.314.0.4.819Patch 4.414.0.4.831Patch 4.514.0.4.855Patch 4.614.0.4.880Patch 4.7 14.0.4.896*Support Pack 5** 14.0.5.882*Patch 5.114.0.5.915

Some encouraging news from IBM for anyone in IT, especially in the analytics field.

http://www.calgaryherald.com/business/productiveconversations/worker+shortage+serious+implications+Canada/7683651/story.html?__lsa=9c1b-d71a

In researching OBIEE, I came across this excellent scorecard set up by the New York City Mayor's Office.

http://www.nyc.gov/html/ops/cpr/html/home/home.shtml

The user experience is decent, with a fairly uncluttered and intuitive design. At the same time, there is a wealth of information that users can delve into. The integration effort to get the many indicators from what must be a great many disparate systems is impressive. The most important goal of a scorecard is to encourage accountability and provide incentive to improve performance. With simple displays that the public can see, comparing departments to each other and comparing their performance over time, I think this scorecard will accomplish that goal.

#scorecard #OBIEE #Oracle #performance management

Just a quick post on getting the Excel connection manager to work in 64 bit SSIS and 64 bit BIDS. You may have run into this error: DTS_E_OLEDB_EXCEL_NOT_SUPPORTED This happens when trying to use the Excel Source in 64 bit BIDS or DTEXEC. For my own reference as much as yours, here is an excellent blog post on how to fix the error. http://blogs.msdn.com/b/priyo/archive/2009/05/04/dts-e-oledb-excel-not-supported-resolved.aspx?CommentPosted=true The short answer is, for BIDS, go to the Project Properties | Debugging | Run 64 Bit Runtime | set to False [cid:[email protected]] For DTEXEC, change the properties of the Step in the SQL Agent job that is running the package. Go into the Step and pick the Execution Options tab and set the 32 bit mode to on. [cid:[email protected]] Cheers

#commentmessage

[cid:[email protected]]

The Many Layers of Permissions Needed for Xcelsius Dashboards

In BOBJ 3.1 there are *several* places where you need to give a user permission in order for them to run an Xcelsius dashboard (.SWF) you have published. Let's go from top to bottom... 1) The .SWF file itself - The user should have something similar to "View on Demand" on the actual SWF file 2) The Webi application - Often overlooked, users need rights to run Webi in order to view dashboards 3) Universe - Users need access to the universe that the dashboard queries 4) QaaWS - Users need to be able to access the QaaWS that the dashboard calls 5) Connection - The connection in CMC that the universe uses must be accessible 6) Database - Make sure the user can access the underlying data. If using SSO, the users Win AD account may need to be granted access to the database. Check these permissions to remedy errors like: "Session is closed" "Cannot find service name=" "...do not have permissions to this universe" Ryan

Broken Dependencies in your universe

A quick tip if you have run into the "Broken Dependencies" error in BusinessObjects 4.0 Information Designer. I tracked my error down to a list of values I recently added to the Data Foundation. Even though I was able to select the LOV in the business layer editor and then even validate the BL - the validation would fail prior to publishing or when the BL files was closed. I think my problem is a corrupt project file. One solution might be to add the LOV to the BL instead of the DF layer. I ended up deleting the LOV altogether.

Never mind the icons - they're for BOBJ on the iPad

If you're wondering what those goofy oil well icons are all about, there is actually a BI connection. I recently put on a demo of BusinessObjects mobile app, BO Explorer (search SAP Explorer in the app store). It is an iPad version of the Explorer query tool provided with BOBJ 4.0 and available online at bi.ondemand.com. The advantage of the iPad/iPhone version is it incorporates the device's GPS, compass, and camera to add some extra wizzy ways to see your data.

In my case, I uploaded some publicly available information on well sites to bi.ondemand.com, including GPS coordinates (lat and long). As long as you have a "POI" column in your data with its contents formatted correctly, the app will automatically plot your rows on a Google map and allow you to drill into those records by tapping the icons on the map. My icons are the tacky oil wells you are seeing on my blog. The icons you show on the map are best specified by including a URL to the image so I had to put my versions of the icons somewhere stable and online - thanks Tumblr.

My final thoughts on the app are positive. For a first crack at mobility, the app and the service are quite reliable and capable. Be aware that this is not a hardcore GIS so you won't be able to handle really tightly packed points or plots of things other than points (lines or polygons for example). It is about the quickest way to put data on a map and have some sort of drill down capability. Nice addition, SAP.

#BOBJ #BusinessObjects #BOE #BI Ondemand #iPad #iPhone #mobile bi

black oil well icon

Trending Blogs

Recently Viewed Blogs

Business Intelligence Shared