Tech Support > Dr. Bizaramor Strikes Back
Data Mining and Data Visualization Tools
Megan:
Term extraction is working. Now it is time for the real fun -- building an XML graph. This may take a few days to figure out (working on it in the evenings), but that will allow time for more message data to accumulate.
name:
@Guardian
Knime current (V2.5.4) looks quite OK, certainly by far not as buggy as Gephi. I use it for small load jobs and to evaluate data from different sources.
@Megan
I made the Fucilla Graph by simply scraping the data about persons and organizations by hand from the first page of the thread and saving it as CSV, then doing the same for the edges, followed by some adjustment by hand. I am looking forward to see what you are doing.
I do SQL, PL/SQL and data design (Postgres/Oracle), and I do not yet understand NoSQL/graph-oriented databases.
@Dant
The bugs of gephi are reported.
Hint: For all the java tools (Gephi, yED, knime, Rapidminer), check that the allowed memory settings for Java are sufficient for what you intend to do.
Look for the conf file or startup script of the tool and set -Xmx to something sensible. From experience I set it to half the physical memory, so, if you have 4GB, set -Xmx2048m if you have lots of data.
Megan:
--- Quote from: name on May 01, 2012, 03:39:39 PM ---...@Megan
I made the Fucilla Graph by simply scraping the data about persons and organizations by hand from the first page of the thread and saving it as CSV, then doing the same for the edges, followed by some adjustment by hand. I am looking forward to see what you are doing.
I do SQL, PL/SQL and data design (Postgres/Oracle), and I do not yet understand NoSQL/graph-oriented databases.
...
--- End quote ---
I am using MS SQL Server 2012 to build a database of nodes and connections. An SSIS package runs once an hour and queries the latest forum posts using a script component written in C#, and then adds any new or changed messages to the database.
A second SSIS package runs on demand (it could be schedule to run every day or so) and applies the Term Extraction component to the message body of each new or changed message. This information is used to build a "term" table containing significant nouns and noun phrases for each message. This represents the graph that I am trying to build, and the message table (maintained by the first SSIS package) represents attributes associated with each node.
After updating the term table the package runs one additional Term Extraction across the entire message table, calculating a TF-IDF score for each noun or noun phrase and storing the values in the term table. I don't know if this will be useful or not, but it was easy to do and is easy to remove.
The next step is to query the graph out as XML, perhaps in GEXF format or in a format that can transform to GEXF or other formats that Gephi or other tools recognize.
The code that extracts messages from the forum needs quite a bit of work. Right now the message bodies are being stored as HTML, and I provide an exception list to the Term Extraction component that causes it to remove HTML tag and style names. This part could be done a lot better.
foofighter:
(Disclaimer: I work for Neo)
For the types of data and queries mentioned here I think the graph data model, in general is the ideal solution, and Neo4j would be a good fit technology wise for the storage. I've been considering doing a forum scrape project myself, whereby one would first make a list of people to be tracked manually (like Fucilla et. al), and the extractor would then simply note what people are mentioned in the same posts, as a way to infer relationships. This would not give high quality data, but would probably produce lots of it, given the amount of forum posts. One option is to use this as a first step to produce suggested relationships, that are then manually confirmed.
When it comes to visualization I know my coworkers and the community have played with a bunch of really nifty JavaScript frameworks. I don't have any experience with it myself, but could easily find out what works and how to do it, if necessary.
I should also say that the new query language for Neo opens some interesting possibilities. Asking things like "Is mr X related to mr Y, and through whom?" is a one liner, literally.
To me the big problem seems to be finding out a good way to get the data in. A Semantic Wiki is one option, which then could be fed to Neo4j or any other visualization tools.
Megan:
--- Quote from: foofighter on May 02, 2012, 04:54:14 AM ---...To me the big problem seems to be finding out a good way to get the data in. A Semantic Wiki is one option, which then could be fed to Neo4j or any other visualization tools.
--- End quote ---
No kidding. I have been working on cleaning the body text by removing HTML tags. All of SMF's XML formats (XML, Atom, RSS) return the message body as HTML. I am trying to treat the HTML as XML and extract just the text and CDATA portions. It doesn't matter what it looks like because it is only seen by the term extraction component. Once I solve that problem I will go back to producing a standard XML graph output format.
The term table is looking good, if a bit dull. The top words are rather ordinary, and I am adding some of them to the exclusion table. I am trying to capture all forum messages, not just select ones. It should be possible to filter the result to remove unwanted topics. The more message attributes I can make available, the more filtering options there will be.
Navigation
[0] Message Index
[#] Next page
[*] Previous page
Go to full version