Tech Support > Dr. Bizaramor Strikes Back

Data Mining and Data Visualization Tools

<< < (7/10) > >>

Megan:
I have created a 4-level graph (board/topic/message/term), but I haven't figured out what to do with it yet within Gephi, other than verify that it contains the data I expect it to contain. It can be downloaded here.

name:

--- Quote from: Megan on May 06, 2012, 05:57:18 AM ---I have created a 4-level graph (board/topic/message/term), but I haven't figured out what to do with it yet within Gephi, other than verify that it contains the data I expect it to contain. It can be downloaded here.

--- End quote ---
What I did with it after downloading:
After I load it, I let it calculate the statistics shown on the right-side panel. Then used the "Yifan-Hu Proportional" layout and set the "optimal distance" parameter above 250 so the nodes dont stick too much to each other. Some nodes cluster in the center and other remained in the periphery.
I then ranked the nodes according to "authority" and let it apply color. I also ranked the nodes according to "Hub" degree and let it resize the nodes.
If I understand correctly what I see, the most important threads of the forum are "Diet and Health", "Whats on your mind" and "Suggest an article for SOTT", and the outsider among the threads is "Cryptozoology".

Megan:
Well that is encouraging! I am working on a new version now that relates the terms in a given message to the predecessor messages instead of the individual terms within those messages. It seems like that is really what is of interest -- "which prior messages contain each term in this message?" The only reason I made the terms their own (4th) level was the Gephi "parallel edge" restriction. I have also given each node level a different color (in the upcoming version of the data).

At the moment, now that I see that there might be something useful coming out of this tool, I am cleaning up the structure of the relational database, which I created quite hastily. I will generate a new graph when everything is running again.

At some point I will want to create something that can load an entire topic into the database -- like the "Handl" topic. I don't know yet when I will be able to work on that.

It would be useful to have a tool that can explore the graph, asking specific questions about terms of interest. It has to be interactive, because as you see what connects with what directly, you may notice other related terms of interest (or different spellings of the same term) in a given node. I have tried filtering using Gephi but haven't had any success so far. Perhaps Neo4j is more suited to that sort of querying?

As usually happens when I create a database, I am finding that I have my hands full just collecting and generating the data. That is why I am not more familiar with advanced visualization tools -- I never reach that stage! If several of you can look at the data I am uploading and provide feedback on how I can improve it, I will be more than happy to focus on that end of things.

Megan:
OK, all cleaned up. The latest version of the graph can be found here.

I finished normalizing the database, creating a proper set of entities along with a staging table, and I improved the extraction process to hopefully eliminate any invalid node IDs in the graph file.

I am trying to decide whether to keep the message bodies in the database. They are in the staging table, but I am not currently including them in the 'permanent' tables because of their size. I am thinking of creating some sort of archive process that will preserve them offline.

Megan:
I've been busy with other things for the past month, but I am trying to take some time this holiday weekend to complete the part of the data collector that will download an entire topic. I would like to see if I can build useful graphs of topics of interest using this tool.

Ideally I would download entire public boards, but I don' t want to create too much of a load on the Forum, let alone my home network, so that will necessarily be a slow process. I will be happy for the moment if I can download a single topic.

I will be extending the existing code, which recognizes the SMF proprietary XML "news" format, to also parse the XHTML returned when a user accesses the website. They are largely the same in principle but very different in the details. The "news" format is perfect for incremental updating, but cannot be used for bulk downloads because it can only return the most recent posts of a specified target.

It should not be hard to make the change, as long as the returned XHTML is valid. I think it will be.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version