Data Mining and Data Visualization Tools

mcb · May 1, 2012

Re: Data Mining and Data Visualization

Term extraction is working. Now it is time for the real fun -- building an XML graph. This may take a few days to figure out (working on it in the evenings), but that will allow time for more message data to accumulate.

name · May 1, 2012

@Guardian
Knime current (V2.5.4) looks quite OK, certainly by far not as buggy as Gephi. I use it for small load jobs and to evaluate data from different sources.

@Megan
I made the Fucilla Graph by simply scraping the data about persons and organizations by hand from the first page of the thread and saving it as CSV, then doing the same for the edges, followed by some adjustment by hand. I am looking forward to see what you are doing.
I do SQL, PL/SQL and data design (Postgres/Oracle), and I do not yet understand NoSQL/graph-oriented databases.

@Dant
The bugs of gephi are reported.

Hint: For all the java tools (Gephi, yED, knime, Rapidminer), check that the allowed memory settings for Java are sufficient for what you intend to do.
Look for the conf file or startup script of the tool and set -Xmx to something sensible. From experience I set it to half the physical memory, so, if you have 4GB, set -Xmx2048m if you have lots of data.

mcb · May 1, 2012

name said:
...@Megan
I made the Fucilla Graph by simply scraping the data about persons and organizations by hand from the first page of the thread and saving it as CSV, then doing the same for the edges, followed by some adjustment by hand. I am looking forward to see what you are doing.
I do SQL, PL/SQL and data design (Postgres/Oracle), and I do not yet understand NoSQL/graph-oriented databases.
...

I am using MS SQL Server 2012 to build a database of nodes and connections. An SSIS package runs once an hour and queries the latest forum posts using a script component written in C#, and then adds any new or changed messages to the database.

A second SSIS package runs on demand (it could be schedule to run every day or so) and applies the Term Extraction component to the message body of each new or changed message. This information is used to build a "term" table containing significant nouns and noun phrases for each message. This represents the graph that I am trying to build, and the message table (maintained by the first SSIS package) represents attributes associated with each node.

After updating the term table the package runs one additional Term Extraction across the entire message table, calculating a TF-IDF score for each noun or noun phrase and storing the values in the term table. I don't know if this will be useful or not, but it was easy to do and is easy to remove.

The next step is to query the graph out as XML, perhaps in GEXF format or in a format that can transform to GEXF or other formats that Gephi or other tools recognize.

The code that extracts messages from the forum needs quite a bit of work. Right now the message bodies are being stored as HTML, and I provide an exception list to the Term Extraction component that causes it to remove HTML tag and style names. This part could be done a lot better.

foofighter · May 2, 2012

(Disclaimer: I work for Neo)

For the types of data and queries mentioned here I think the graph data model, in general is the ideal solution, and Neo4j would be a good fit technology wise for the storage. I've been considering doing a forum scrape project myself, whereby one would first make a list of people to be tracked manually (like Fucilla et. al), and the extractor would then simply note what people are mentioned in the same posts, as a way to infer relationships. This would not give high quality data, but would probably produce lots of it, given the amount of forum posts. One option is to use this as a first step to produce suggested relationships, that are then manually confirmed.

When it comes to visualization I know my coworkers and the community have played with a bunch of really nifty JavaScript frameworks. I don't have any experience with it myself, but could easily find out what works and how to do it, if necessary.

I should also say that the new query language for Neo opens some interesting possibilities. Asking things like "Is mr X related to mr Y, and through whom?" is a one liner, literally.

To me the big problem seems to be finding out a good way to get the data in. A Semantic Wiki is one option, which then could be fed to Neo4j or any other visualization tools.

mcb · May 2, 2012

foofighter said:
...To me the big problem seems to be finding out a good way to get the data in. A Semantic Wiki is one option, which then could be fed to Neo4j or any other visualization tools.

No kidding. I have been working on cleaning the body text by removing HTML tags. All of SMF's XML formats (XML, Atom, RSS) return the message body as HTML. I am trying to treat the HTML as XML and extract just the text and CDATA portions. It doesn't matter what it looks like because it is only seen by the term extraction component. Once I solve that problem I will go back to producing a standard XML graph output format.

The term table is looking good, if a bit dull. The top words are rather ordinary, and I am adding some of them to the exclusion table. I am trying to capture all forum messages, not just select ones. It should be possible to filter the result to remove unwanted topics. The more message attributes I can make available, the more filtering options there will be.

foofighter · May 2, 2012

Megan said:
No kidding. I have been working on cleaning the body text by removing HTML tags. All of SMF's XML formats (XML, Atom, RSS) return the message body as HTML. I am trying to treat the HTML as XML and extract just the text and CDATA portions. It doesn't matter what it looks like because it is only seen by the term extraction component. Once I solve that problem I will go back to producing a standard XML graph output format.

For converting messy HTML to XML, TagSoup is your friend: http://ccil.org/~cowan/XML/tagsoup/ It's awesome for this kind of stuff!

The term table is looking good, if a bit dull. The top words are rather ordinary, and I am adding some of them to the exclusion table. I am trying to capture all forum messages, not just select ones. It should be possible to filter the result to remove unwanted topics. The more message attributes I can make available, the more filtering options there will be.

What about what I suggested, i.e. the other way round where you have a manual list of words you are looking for, such as names of individuals. If using Neo, you could then tag those differently from, for example, organizations.

mcb · May 2, 2012

If you capture everything, you can search for what you are looking for within it. If you only capture what you are looking for, you will miss anything you didn't think to look for.

I did quite a bit of cleanup on the message bodies just using an XmlTextReader. It's very fast -- much faster than the term extraction process, and it works because SMF is emitting valid XHTML. If I start to read from other sources, TagSoup could come in handy -- thanks!

Here is a sample from the term table:

MessageID Term Score
339138 FDA 15.2995992834726
339013 feature 99.2432087578843
339101 feature 99.2432087578843
339392 feature 99.2432087578843
339083 February 15.2995992834726
339384 female body 26.4403154835855
339166 field 37.6521812635504
339174 field 37.6521812635504
339276 figure 11.5860272167683
339106 finding 18.7776052788641
339140 FIR sauna 11.5860272167683
339023 fish oil 11.5860272167683
338987 fishing 111.407162001129
339022 fishing 111.407162001129
339032 fishing 111.407162001129
339034 fishing 111.407162001129
339227 fishing 111.407162001129
339243 fishing 111.407162001129
339337 fishing 111.407162001129
339259 Flemish 11.5860272167683
339338 food 46.313497420346
339402 food 46.313497420346
339083 food chain 15.2995992834726

When a word repeats, that means it appears in more than one post. The numbers are the TF-IDF scores.

foofighter · May 2, 2012

Megan said:
If you capture everything, you can search for what you are looking for within it. If you only capture what you are looking for, you will miss anything you didn't think to look for.

Good point! So, basically you could first create a "raw" data capture, then investigate it manually and mark the things you look for (in a particular instance), and then do further analysis based on that subset.

I did quite a bit of cleanup on the message bodies just using an XmlTextReader. It's very fast -- much faster than the term extraction process, and it works because SMF is emitting valid XHTML. If I start to read from other sources, TagSoup could come in handy -- thanks!

Here is a sample from the term table:

MessageID Term Score
339138 FDA 15.2995992834726
339013 feature 99.2432087578843
339101 feature 99.2432087578843
339392 feature 99.2432087578843
339083 February 15.2995992834726
339384 female body 26.4403154835855
339166 field 37.6521812635504
339174 field 37.6521812635504
339276 figure 11.5860272167683
339106 finding 18.7776052788641
339140 FIR sauna 11.5860272167683
339023 fish oil 11.5860272167683
338987 fishing 111.407162001129
339022 fishing 111.407162001129
339032 fishing 111.407162001129
339034 fishing 111.407162001129
339227 fishing 111.407162001129
339243 fishing 111.407162001129
339337 fishing 111.407162001129
339259 Flemish 11.5860272167683
339338 food 46.313497420346
339402 food 46.313497420346
339083 food chain 15.2995992834726

Click to expand...

When a word repeats, that means it appears in more than one post. The numbers are the TF-IDF scores.

Nice!

Let me know if you need any suggestions on how to use Neo4j or visualizations related to it. I can easily find out from the team, if need be.

mcb · May 2, 2012

foofighter said:
Let me know if you need any suggestions on how to use Neo4j or visualizations related to it. I can easily find out from the team, if need be.

I was planning to make the data available for download. I am not sure what to do with data from boards that are not public, however. I could suppress that data altogether (or control access to it), or I could limit the attributes included in the graph. The message bodies will not be part of the graph, but exposing the term list could provide a clue as to what was being discussed privately. I guess the best approach for now would be to suppress the private and access-limited boards.

My collector accesses the forum directly without using a web browser, and I did not attempt to pass my logon information, so I assume that the present database contains only public posts. Hopefully that means I can ignore the issue for now (after verifying my assumption), although I will certainly want to address it later on. What I might do is collect all public boards but only specific private or limited access boards. Detecting topics and replies that are moved from a public board to a private one could be a little tricky but I think I can do it.

So while collecting the data and building graphs are things that relate to my specialties, I would be more than happy to share figuring out the NoSQL and graph visualization parts of it, where I am starting out from scratch.

mcb · May 3, 2012

A little more progress -- I build a nodelist:

<node id="338991" label="Re: Podcasts et communiqués SOTT en français !" />
<node id="338992" label="Re: &quot;Life Without Bread&quot;" />
<node id="338993" label="Re: &quot;Life Without Bread&quot;" />
<node id="339000" label="Re: Getting a &quot;Handl&quot; on things? " />
<node id="339001" label="Re: &quot;Life Without Bread&quot;" />
<node id="339002" label="Re: Cryogenic Chamber Therapy" />
<node id="339003" label="Re: Leukemia" />
<node id="339005" label="Re: &quot;Life Without Bread&quot;" />
<node id="339006" label="Re: Laura's books at Amazon" />
<node id="339008" label="Re: &quot;Life Without Bread&quot;" />
<node id="339009" label="Re: Leukemia" />
<node id="339010" label="Re: Anonymous Message Attack Planned on Olympics 2012 by Government " />
<node id="339011" label="Re: &quot;Life Without Bread&quot;" />
<node id="339013" label="The Impossible " />
<node id="339014" label="Re: &quot;Life Without Bread&quot;" />
<node id="339015" label="Re: Iodine" />

I spent most of my time tonight figuring out the best way to write the data out, leaving little time to work on the actual data. The next steps are to add attributes to each node, build the edge list, wrap it all with a few more elements, and try to feed it into Gephi. The export process currently writes to a file, but it could potentially upload to my website, automatically.

mcb · May 4, 2012

name said:
...I made the Fucilla Graph by simply scraping the data about persons and organizations by hand from the first page of the thread and saving it as CSV, then doing the same for the edges, followed by some adjustment by hand. I am looking forward to see what you are doing...

I am able to write out a GEXF file now and read it into Gephi. I need to shape the data differently, however, and I need to learn how to use Gephi.

The simple node lists I am creating now are too simple. They contain "parallel edges" that Gephi can't process. I can solve that by adding hierarchy -- instead of having a flat list of message nodes the graph can contain each "term" as a child of each message (forum post) in which it appears. The edges would then connect terms instead of messages. While I am at it, I should include the board & topic hierarchy above the messages.

Would that be a useful graph? Now that I have a basic set of components for building graphs, I can construct them any way that would work for visualization, but what would work well?

Presently, the only edges in the graph are from a "repeats" relationship, where an edge represents a word in a message that is a repeat of that same word in some earlier message. I can add another set of edges for a "quotes" relationship, where an edge connects an earlier message with a later one that quotes from it. I also have data for each message about who posted it (name and forum profile link) and who started the topic (name and forum profile link) that contains it.

paralleloscope · May 4, 2012

Most of the tech jargon and acronyms here is over my head, but find it an interesting topic and aim to learn more about it (recently did my first mysql).

Megan said:
While I am at it, I should include the board & topic hierarchy above the messages.

Would that be a useful graph? Now that I have a basic set of components for building graphs, I can construct them any way that would work for visualization, but what would work well?

Is it possible to do a layered graph, so that one could choose what level of complexity to view or combine. Like board/topic data on one layer, names in network on another, additional related data on a third layer and so forth ? That way one could make presentations match the presentation context by adding/subtracting layers.

name · May 5, 2012

Megan said:
I am able to write out a GEXF file now and read it into Gephi.

Cool! Congratulaitons for your advances and thanks for your interest in this and the great work!

Megan said:
... Would that be a useful graph? Now that I have a basic set of components for building graphs, ...

I don't know. I suppose that it depends on what one wants to understand with such a graph.

My original idea was to use graphs to help understand the relationship of people, organizations... mentioned in threads such as the Fucilla one, or the longer Rense thread (for example), where people discuss a subject and apport information about people, entities, events ... over time. I am currently trying to make knime get the entities (see http://en.wikipedia.org/wiki/Named_entity_recognition) from a text, and in next step I'll look if can make it find relations between these entities and if there is a way to tag them in a sensible way. What I want is to eventually get a graph which more-or-less resembles something I'd also do by hand.

The thing with the "parallel edges" to which you allude is the weight of the relation between those particular two nodes. You must use the "weight" field of the edge so that gephi can process it and count up each time you find another relation between these two nodes - see the GEXF draft primer at "2.3.3 Declaring an Edge" on page 7.

You may also want to look at http://gexf.net if you haven't discovered it yet.

mcb · May 5, 2012

parallel said:
Most of the tech jargon and acronyms here is over my head, but find it an interesting topic and aim to learn more about it (recently did my first mysql).

It wants to be over my head too, but I keep batting it back down. When I started in this field, a long time ago, it was actually possible to understand what you were doing, and I understood what was going on right down to the circuit level. Now you learn things at a certain high level and don't worry about the rest unless it quits working -- there are too many levels. How far down you go depends on how fascinated you are with with it. I am not fascinated with it much at all; I just want to be able to do something with it.

Part of the challenge is that "lazy" system 2. It doesn't want to deal with so much complexity, and you have to keep telling it that it isn't that bad and to hang in there. I learned long ago to divide problems into "chunks" of limited size, each of which (hopefully) is not overwhelming. The chunks might be nested one within another, or they might be parts of a system flow where one chunk does its work and hands off the results to the next chunk.

Is it possible to do a layered graph, so that one could choose what level of complexity to view or combine. Like board/topic data on one layer, names in network on another, additional related data on a third layer and so forth ? That way one could make presentations match the presentation context by adding/subtracting layers.

I think that is what making the node list hierarchical is going to do. I don't know Gephi well enough yet to predict how changes to the input file are going to manifest in the visualizations. Once I have a good set of input data I will gain experience with Gephi.

mcb · May 5, 2012

name said:
Megan said:

I am able to write out a GEXF file now and read it into Gephi.

Click to expand...

Cool! Congratulaitons for your advances and thanks for your interest in this and the great work!

Megan said:

... Would that be a useful graph? Now that I have a basic set of components for building graphs, ...

Click to expand...

I don't know. I suppose that it depends on what one wants to understand with such a graph.

My original idea was to use graphs to help understand the relationship of people, organizations... mentioned in threads such as the Fucilla one, or the longer Rense thread (for example), where people discuss a subject and apport information about people, entities, events ... over time. I am currently trying to make knime get the entities (see http://en.wikipedia.org/wiki/Named_entity_recognition) from a text, and in next step I'll look if can make it find relations between these entities and if there is a way to tag them in a sensible way. What I want is to eventually get a graph which more-or-less resembles something I'd also do by hand.

Thanks. I'll digest that when I have some free time. :)

The thing with the "parallel edges" to which you allude is the weight of the relation between those particular two nodes. You must use the "weight" field of the edge so that gephi can process it and count up each time you find another relation between these two nodes - see the GEXF draft primer at "2.3.3 Declaring an Edge" on page 7.

You may also want to look at http://gexf.net if you haven't discovered it yet.

Gephi just flat ignored the parallel edges; said it wasn't implemented yet. The problem should go away once I implement a board/topic/message/term hierarchy in the node list.

Apparently I can also model changes in the graph over time. That sounds like loads of fun!

Data Mining and Data Visualization Tools

mcb

The Living Force

name

Jedi Master

mcb

The Living Force

foofighter

Jedi Council Member

mcb

The Living Force

foofighter

Jedi Council Member

mcb

The Living Force

foofighter

Jedi Council Member

mcb

The Living Force

mcb

The Living Force

mcb

The Living Force

paralleloscope

The Living Force

name

Jedi Master

mcb

The Living Force

mcb

The Living Force

Trending content