Building an Open Source Threat Database Pt. 1 - Architecting your Data
Background
I follow a ton of people who dissect malware, track malicious operator campaigns and publish the indicators retrieved from this research. Though there are some really great tools for OSINT data gathering (I particularly like Automater by the TekDefense guys which allows you to take advantage of most informational sites that allow web requests), for a long time I've been interested in compiling all the public threat information I read daily into my own threat database, that would track and pull information from my favorite public sources which I could search though during my own personal research.
The Process
First things first, in classical "break steps down into smaller steps", I've identified a good subset of the hoops I expect to be hopping through. I'm planning on breaking down each one into it's own blog post here. I'm not going to go into each one here, suspense building and all that. The first step which I'm covering today is to identify the specific data I'd like to be collecting. Which brings me to:
The Data
My first iteration of the data I found useful and wanted to collect resulted in a table that was half empty for most entries because for whatever reason I included a column for each indicator type. Figuring that would be an issue, I've broken the data down into this more streamlined version:
-------------------------------------------------------------------------- | key | dateAdded | indicatorType | indicator | references | summary | source | sourceLink | keywords | category | --------------------------------------------------------------------------
A breakdown of the columns:
Admin rows: -key (added by db) -dateAdded (timestamp) -source (blog name) -summary (blog post title) -sourceLink (url of source)
*indicatorType -md5 -sha -filename -IPAddr -netRange -url -domain -uri
*indicator (the indicator itself) *references -links to sites included, thinks like URLQuery and VirusTotal links
*future -keywords (learning function/wordlist to capture relevant terms) -category (C&C, hacked site, malware, etc)
Why do this when you can Google Search?
The local db and involved script has the advantage of collecting info from anything with an RSS feed including internal corporate wikis, private groups, etc. Additionally, my OSTDB will have the ability to do targeted collection, making sure any hits returned by a query will be relevant. No digging or verification sources upfront since they've already been verified when added.
Next Time
There is a case to be made that I should have architected the entire system before figuring out what data to collect. In truth I've sort of had an overall idea in my head for quite a bit. I'm planning to include more information on the overall design in next weeks post as well as some alpha code for the project.






