Data Mining and Data Visualization Tools

mcb · May 6, 2012

I have created a 4-level graph (board/topic/message/term), but I haven't figured out what to do with it yet within Gephi, other than verify that it contains the data I expect it to contain. It can be downloaded here.

name · May 6, 2012

Megan said:
I have created a 4-level graph (board/topic/message/term), but I haven't figured out what to do with it yet within Gephi, other than verify that it contains the data I expect it to contain. It can be downloaded here.

What I did with it after downloading:
After I load it, I let it calculate the statistics shown on the right-side panel. Then used the "Yifan-Hu Proportional" layout and set the "optimal distance" parameter above 250 so the nodes dont stick too much to each other. Some nodes cluster in the center and other remained in the periphery.
I then ranked the nodes according to "authority" and let it apply color. I also ranked the nodes according to "Hub" degree and let it resize the nodes.
If I understand correctly what I see, the most important threads of the forum are "Diet and Health", "Whats on your mind" and "Suggest an article for SOTT", and the outsider among the threads is "Cryptozoology".

mcb · May 7, 2012

Well that is encouraging! I am working on a new version now that relates the terms in a given message to the predecessor messages instead of the individual terms within those messages. It seems like that is really what is of interest -- "which prior messages contain each term in this message?" The only reason I made the terms their own (4th) level was the Gephi "parallel edge" restriction. I have also given each node level a different color (in the upcoming version of the data).

At the moment, now that I see that there might be something useful coming out of this tool, I am cleaning up the structure of the relational database, which I created quite hastily. I will generate a new graph when everything is running again.

At some point I will want to create something that can load an entire topic into the database -- like the "Handl" topic. I don't know yet when I will be able to work on that.

It would be useful to have a tool that can explore the graph, asking specific questions about terms of interest. It has to be interactive, because as you see what connects with what directly, you may notice other related terms of interest (or different spellings of the same term) in a given node. I have tried filtering using Gephi but haven't had any success so far. Perhaps Neo4j is more suited to that sort of querying?

As usually happens when I create a database, I am finding that I have my hands full just collecting and generating the data. That is why I am not more familiar with advanced visualization tools -- I never reach that stage! If several of you can look at the data I am uploading and provide feedback on how I can improve it, I will be more than happy to focus on that end of things.

mcb · May 7, 2012

OK, all cleaned up. The latest version of the graph can be found here.

I finished normalizing the database, creating a proper set of entities along with a staging table, and I improved the extraction process to hopefully eliminate any invalid node IDs in the graph file.

I am trying to decide whether to keep the message bodies in the database. They are in the staging table, but I am not currently including them in the 'permanent' tables because of their size. I am thinking of creating some sort of archive process that will preserve them offline.

mcb · May 26, 2012

I've been busy with other things for the past month, but I am trying to take some time this holiday weekend to complete the part of the data collector that will download an entire topic. I would like to see if I can build useful graphs of topics of interest using this tool.

Ideally I would download entire public boards, but I don' t want to create too much of a load on the Forum, let alone my home network, so that will necessarily be a slow process. I will be happy for the moment if I can download a single topic.

I will be extending the existing code, which recognizes the SMF proprietary XML "news" format, to also parse the XHTML returned when a user accesses the website. They are largely the same in principle but very different in the details. The "news" format is perfect for incremental updating, but cannot be used for bulk downloads because it can only return the most recent posts of a specified target.

It should not be hard to make the change, as long as the returned XHTML is valid. I think it will be.

seek10 · May 28, 2012

Megan said:
I've been busy with other things for the past month, but I am trying to take some time this holiday weekend to complete the part of the data collector that will download an entire topic. I would like to see if I can build useful graphs of topics of interest using this tool.

Good Job Megan for the initiative and progress.
Initially I was excited with this project , but soon I realized lack of proper skill set to go further and couldn’t get time either.
Here are some thoughts while trying to visualize how this can work.
The main thing any data mining project is structured data and ability to accumulate data in the same consistent pattern and link it and present it. Based on the data in the thread, there is lot of skill set to do it and I tried to fit in the general Data warehouse model (while violating some basic DW principles for simplicity).
though I put different boxes in traditional DW sense , a single program accomplish end to end ( as Megan already did ) or

Let’s visualize the purpose as The ability to connect different entities ( people,events,organization,products) based on the individual entities entered , Link and present .
For this we need
• Choose consistent architecture ( DB, Tools ) for re usability. - DB could be simple Mysql(Neo4J) hosted on central location, extraction tools, presentation tools like Gephi, YED etc.
• To create a structured , reusable data structure – Data Model
• Consistent ID Generation – consistent across what ever the means(technology ) used. Having consistent tools helps in using consistent ID generation.
• Ability to Clean the data , merging the ID’s either programmatic way or Through some custom or available screen – Here I see a need to have a web screen based on the data model, that has the ability to display the DB content and gives the ability to update the ID’s etc. OR can be done through programmatically, if possible. OR we can use the Mysql screens itself or DML's to update the data.
There must be some ready made web screen’s available on the net, but I couldn’t find in quick search.

• Ability to bring relationship between ppl, org, events, products or using some custom screen (Manual) or using queries. – Here I see a need to have a web screen that gives ability to display DB table data and it gives list Items of entities that we can select different ID’s to bring relationship OR Programmatically build the relationship OR Manually build the data into relationship in Excel and upload

• Presentation part can be achieved by many means like create DB views, XML or any other format the tools like Gephi, Yed accept.

• A web screen that gives Feature like Linking like how linkedIn does it ( if we give some company name, it will gives the people who are in that company that linked to our profile through some intermediaries – Not a easy task but worth it ). This type of feature will be great. But Data quality becomes very important else it will become Junk In – Junk out. If we can organize the data in hierarchically, simple SQL will do the trick.

Though I felt web screens are useful for non technical people to clean the data, some of these can be done by MySql itself.

The structured data is represented by data model. I tried to give a shot at the model , this is very rough one, I voilated some standard rules of normalization to avoid the additional work.( Ignore the relationships 1:N etc) .
Note: We can add many additional information like , source ( node id, board id etc) , Graphs formatting info ( nodes width , length ) etc.

I can help in the data modeling aspect and other activities.

mcb · May 28, 2012

Seek10,

The collector that I have been working on is meant as a proof of concept and, frankly, just a way to start collecting data on what is in the forum. From the beginning I was interested in tying forum data to other sources, but it is too much for me to take on right now. Any modeling work you can do would be helpful. Higher-level logical models are fine.

My day job includes maintaining and expanding a small Kimball-type data warehouse of information useful to the non-profit medical association that I work for. I started it about three years ago, and I have been working with this architecture on and off for over 12 years. I use SQL Server, first because the organizations I have worked for prefer it, but more importantly because it provides high value relative to its cost. The cost is low for a qualifying non-profit that can obtain licenses through TechSoup. Development can be done using SQL Server Developer Edition, which can be had on Amazon for $50 and contains no functional restrictions.

I have not tried to build a true warehouse with what I have done so far. All it is is an SMF adapter that lets me collect publicly visible forum data without requiring a direct database connection, along with code to write a GEXF graph file. It is closely tied to SMF and to the "Cassiopaea Morning" template, and I am using SMF's keys rather than surrogate keys. It could, however, feed into a data warehouse.

seek10 · May 28, 2012

Megan said:
Seek10,

The collector that I have been working on is meant as a proof of concept and, frankly, just a way to start collecting data on what is in the forum. From the beginning I was interested in tying forum data to other sources, but it is too much for me to take on right now. Any modeling work you can do would be helpful. Higher-level logical models are fine.

In this case, we need to try different things to create decent quality data and based on that we can modify the model. I suspect that we have to do data cleansing manually. I thought mysql may be a good one as this can be installed on website and more people can see and give their input. when you have some good amount of data, I can take a look at the data and modify the model to fit in to presentation tools need.

mcb · May 28, 2012

The graph creation process uses an exclusion list to eliminate "uninteresting" words (noun phrases), but I have no good method right now for managing that list. It could be done interactively using PHP/MySQL, although my own web development experience is with ASP.NET/SQL Server.

SQL Server 2012 includes Data Quality Services for rule-based automated cleansing but, as you say, some things would have to be done manually.

My personal website is Apache/PHP/MySQL-based, because I run things like WordPress, Joomla, and SMF. If I developed any interactive tools myself, I would be inclined to add a second, Windows-based site from my current hosting service. It's not particularly expensive.

seek10 · May 29, 2012

Megan said:
The graph creation process uses an exclusion list to eliminate "uninteresting" words (noun phrases), but I have no good method right now for managing that list. It could be done interactively using PHP/MySQL, although my own web development experience is with ASP.NET/SQL Server.

SQL Server 2012 includes Data Quality Services for rule-based automated cleansing but, as you say, some things would have to be done manually.

My personal website is Apache/PHP/MySQL-based, because I run things like WordPress, Joomla, and SMF. If I developed any interactive tools myself, I would be inclined to add a second, Windows-based site from my current hosting service. It's not particularly expensive.

you can develop and use whatever tools and DB we want for now. but tomorrow more people wants to load the data , ID's will go out of synch OR duplicated if we are not using centralised DB 's ID mechanism. Since you are the only person who is parsing the data, you can build it in SQL server, if we go with plan of intranet based tool DB we can easily port it as long the data model is same. If more people wants to parse the data or we can share the same data in files and import to individual DB, it since very few people will be working.

Personally I haven't used sql server, but I can get it and take a look at it. Mostly I used oracle , db2 in my life.

mcb · May 29, 2012

I started out to create a more general purpose collector, but I quickly noticed that that was going to be a lot of work and that I needed to focus on SMF & the Cass Forum for now. The only usable keys that will work with the forum if the internal IDs change (when moving to new forum software, for example) will be the reply/topic/board/user names/subjects and the actual terms that have been identified within the messages. If the internal IDs change, it breaks URLs pointing into the forum, because URLs to anything deeper than the forum home page contain those IDs.

The terms, at least are common keys that can connect forum posts with other sources.

I am using SQL Server because I know it well and don't need to spend time coming up to speed on how to use it, and finding time to work on this can be a challenge. SQL Server also provides me with all the tools I need in one place. It is relatively inexpensive, and using other "free" tools at this point would cost me more time than I have to spare. The data could, however, move to another platform later on.

I am testing the new code with various topics on the forum. So far I have imported the "Handl" topic and "Life Without Bread." When I tried the "Smoking" topic, though, it choked (pun intended). I am working on that problem now. Once I have topic importing working, I can look at doing whole boards. At some point I will produce a new GEXF file from what I have collected. Once the basic database is built, I can start to think about more ways to use the data.

seek10 · May 30, 2012

If I am right, these ID's are for the source ( reply/topic/board/user names/subjects ) of the data. we may have to come up with ID's for person, organisation etc to avoid the duplicate persons, organisations ( because you need unique ID 's for Gephi to plot ). One way you can create ID's is to maintain one lookup table for person , and organisation. check any body exist in the lookup before creating a new ID. while updating the lookup. Unfortunately lookup based on first name and last name is very error prone ( due to duplicate names ) and this will be hurdle ( or we can start with this and see how it goes ). once we cross the hurdle , lot of other things can be strightward. This is where master data management companies mint millions.

btw I ordered the Sql server 2012 developer version with a hope to use it.

mcb · May 30, 2012

The SMF IDs are the best I can do for now. I added an alpha prefix to distinguish one type of ID from another (message/term/user/etc.). Surrogate keys can be added easily enough later on. I am still stuck on why the import chokes on the Smoking topic. There is an error due to some unidentified HTML entity in the incoming page, which throws an exception, and things don't work right after that in spite of the error handling I have in place.

SQL Server can represent a major learning curve, but I could send you a backup of my database and a copy of my SSIS packages and you should be able to import and export.

mcb · May 30, 2012

I found why I couldn't parse the "Smoking" thread: SMF is emitting invalid HTML. There is an ampersand in some of the reply subject lines that is sent as "&" instead of "&". I am going to have to add a front end "cleaner" before I can read that.

The ampersands are probably coming out of the database. Some earlier version of SMF failed to encode them, perhaps, and now that they are stored that way, they are sent back out incorrectly.

seek10 · May 30, 2012

Megan said:
I found why I couldn't parse the "Smoking" thread: SMF is emitting invalid HTML. There is an ampersand in some of the reply subject lines that is sent as "&" instead of "&". I am going to have to add a front end "cleaner" before I can read that.

The ampersands are probably coming out of the database. Some earlier version of SMF failed to encode them, perhaps, and now that they are stored that way, they are sent back out incorrectly.

Yes, this is a common problem with data parsing and integrating the data with different independent tools as some toolshas their own reserved words that can misinterpret. Recently we worked on a project where '&' went into xml file as data and that got misinterpreted. We ended up looking for reserved words.

Data Mining and Data Visualization Tools

mcb

The Living Force

name

Jedi Master

mcb

The Living Force

mcb

The Living Force

mcb

The Living Force

seek10

The Living Force

mcb

The Living Force

seek10

The Living Force

mcb

The Living Force

seek10

The Living Force

mcb

The Living Force

seek10

The Living Force

mcb

The Living Force

mcb

The Living Force

seek10

The Living Force

Trending content