Data Mining and Data Visualization Tools

name

Jedi Master
I posted some days ago info about a tool (Gephi) which can be used to represent data as graphs, which are easier to understand at a glance than plain text. I was worried it would overwhelm people because I have no idea of what level of computer proficiency people have or their professional backgrounds.

(Edited Stub)

I posted below a short list of tools which I have used more or less extensively for analyzing data and/or to represent it graphically. The main criterium for posting them here is ease of use (and that I have used them to some extent). My intention is to introduce some tools which hopefully will support what SOTT does (and not become a distraction).

A good starting point in the internet for people interested in data mining and related themes is the KDNuggets website --> http://www.kdnuggets.com
Warning: the site can be a bit overwhelming because of the amount of information it provides.

Basic pre-requisites to use these tools are:
- Know how to use a text editor
- Know how to use a spreadsheet program, like Excel or Openoffice/Libreoffice
- Know your data, how it is stored, what kind of information you would like to get from it <<--- Most important
- Additionally, an understanding of basic mathematics and statistic concepts like sums, averages, standard deviations ... set (union, disjunction) and logic (and, or, not) operations ... is very helpful for anything related to the analysis of data.
- Beyond that, even a basic knowledge of regular expressions is helpful.

Basic Tools for data analysis:
- Text editors - are included with all (most) systems. Examples are notepad (Windows), vi (linux). Use to edit and organize your data. There are many commercial and free plaintext editors available.
- Spreadsheets - These work with grids of data or formulas. The most well-known examples are Excel (commercial) and Openoffice/Libreoffice (Opensource). Spreadsheets are universal tools used to store, organize, transform and evaluate data, and can be used to do complex calculations and to create graphs of numeric data.

Specialized tools - Network Visualization
A lot of what I have seen on this forum and in many of the articles published on SOTT.net deals with people, organizations and the relations between them. The following tools can represent such "social networks" as pictures, making them better understandable.
- Gephi - Visualization of graphs (networks) - I think that this one has been introduced already :-) Website: http://gephi.org
- yED - This is another tool to edit and view graphs. It is simple and versatile and at the same time powerul. Its features partially overlap with gephi. yED is distributed as a java .jar file, what means that it can run on any system which has java, and that you will probably need to install java if you have Windows (download from _www.java.com). The website for yED is http://www.yworks.com/en/products_yed_about.html
- Cytoscape - a tool similar to Gephi but IMO more difficult to handle. It hails from the bioinformatics field but has many other applications. Not recommended for beginners. Website: http://cytoscape.org
- Article on Social Networks at Wikipedia --> http://en.wikipedia.org/wiki/Social_network

Advanced universal tools
These tools can be used for a vast array of tasks related to data mining, data analysis, conversion, storage etc. Similar to spreadsheets they have a very low user entry level, but they also have features which allow for them to be used in very complex scenarios. If you understand spreadsheets and Lego, then you will likely know how to use these tools. One example use of such tools would be to download an excel file from the internet, do some needed transformations and load it into a database. These tools allow for doing operations on data, like calculations, replacing strings, selecting values, etc. but instead of showing you the data as a grid, you drop "steps" - icons representing some sort of calculation - on a canvas and connect them.
- Knime - The Konstanz Information Miner. just like spreadsheets, It offers many - several hundreds - of operators on data, it can read various types of data and you can even construct web spiders for various purposes with it. It should be intuitive enough for people to be able to put together simple analysis jobs on-the-fly. Be warned though, that it has many features and components which are not intuitive.
The website for Knime is http://knime.org
- Rapidminer - This tool is very similar to Knime, but IMO less intuitive (but it makes prettier graphics). If you think in "variables/samples" instead of "columns/rows" then you might find it interesting. Website for this product is http://rapid-i.com/

I will leave this here for now, in the hope that the info is helpful.

(Edited and expanded lots)
 
Re: Data Mining and Data Visualization

I think this is a great idea. While reading certain books like Controversy of Zion and Family of Secrets (on the Bush family), I thought it would be great to see all the personal connections mentioned in diagram form. Now, maybe we can do it? Also, the Dutroux perps...
 
Re: Data Mining and Data Visualization


@ name: Please note that Gephi 0.8.1 is Beta.
I have joined the forum to complain about some
things that aren't working very well. This site is
moderated so, I cannot see if my posting will be
accepted. Ugh.

For me, these are the problems I found so far:

1) Preview simply is empty.
2) Snapshot hangs (spins its wheels)
3) Export: pdf,png,svg File... seems to be unstable.
a) If one uses PNG, selections Options, and sets "Transparent"
it remains in effect unless one restarts gephi. Unchecking
has no effect.
b) The saved file image node text is clipped at the borders just
like your posted image in the Fucilla thread.
c) A random situation occurred that I could no longer save
the image with full details, the text completely disappeared
for all objects (node, edges, etc.) and somehow, instead of
straight connecting lines, they are all curved connecting lines!

So basically, I am having trouble saving an image file.

Those who wish to use gephi as a project ought to join the gephi
forum to keep track of updates, report bugs, and so on, osit.
(Forum @ _http://forum.gephi.org)
 
Re: Data Mining and Data Visualization

@Dant
Yes, I forgot to mention the Beta status :-( probably because I have become so accustomed to the bugs that I work around them.
I know some other tools, but gephi is IMO the simplest one to use, that's why I introduced it here.
Re your problems:
1) Click Refresh in the Preview page, then switch to Overview and back. Repeat if the first time it does not work. Zoom in and out.
2) Have never seen this one.
3.a) I have no problems exporting transparent images.
3.b) Oh how awkward of it to clip the picture:-) To work around this, place some (unconnected) nodes a bit outside of your graph before exporting.
3.c) It looks like it forgets stuff. Try the following to correct this situation:
Set your preferences in the pane on the left,
- click "Show Labels" under Node Labels.
- click "Show Labels" under Edge Labels.
- under "Edges", de-select "Curved"
- Check the fonts and set some which you have lest it fumbles that too in the output
Save your preset by using the small button right over the "Presets" dropdown.


It looks like a good idea to bring these things to the Gephi forum.

Here is the Fucilla graph re-exported to avoid the clipping:
 
Re: Data Mining and Data Visualization

@name

What I mean in 3a is that once you set to transparent,
you cannot unset it. You have to restart gephi.

Thanks for the instructional/tutorial details so
these instructions should help others as well.
I can create the file images and get the text from
being clipped!

I wonder if the other gephi details in the threads posted in
fucilla ought to be merged into this post so that this post
flows better?

Again thanks!
 
Re: Data Mining and Data Visualization

With exception of the graph which IMO belongs there, I concur that it would be polite to merge the posts about this tool here, but I have no idea how to merge them into this thread. Can you do that?
And yes, and I discovered some more things ...

I will post these problems over at the Gephi forum tomorrow. Must go to bed now. Good night.
 
Re: Data Mining and Data Visualization

No, I am not a moderator, only mods can do this.
 
Re: Data Mining and Data Visualization

Instead of merging the threads, would it be better to move this thread to the Creative Acts board and leave it as s stand alone topic? :huh:
 
Re: Data Mining and Data Visualization

Name, in regards to your reference to knime in the other thread, yeah I'm familiar with it, and I'm not really a fan. Of course it could just be because I'm old, but I think visual languages are soooooo limited compared to text languages. Knime was also VERY buggy when I tried it, but tht was awhile ago.

I suppose it could be useful for science stuff, but not so much for the kind of data we're working with at the moment.
 
Re: Data Mining and Data Visualization

name said:
I posted some days ago info about a tool (Gephi) which can be used to represent data as graphs, which are easier to understand at a glance than plain text. I was worried it would overwhelm people because I have no idea of what level of computer proficiency people have or their professional backgrounds.
I have been thinking for a while that a lot of what SOTT does is collecting and evaluating data, so perhaps some sort of infrastructure and some information about tools to do such things could be useful.
I am starting this thread (for now as a stub) so that over time people can post information about tools which can be used to procure, organize and visualize data.

"Business Intelligence" (BI) is a major part of what I do for a living. I am not familiar with this particular tool (my experience is with Microsoft databases and BI tools) but if there is interest in the topic then I will contribute what I can.

I have been thinking not so much about visualization as about how to capture and relate the kind of data that has been accumulating here. That is a prerequisite to visualizing it with software tools. Unfortunately, I haven't had much time to think about it, but working as a group I think we could go somewhere with it.

Edit: Removed a "devil" that appeared out of nowhere in the middle of a sentence. I must have overlooked a detail. :)
 
Re: Data Mining and Data Visualization

@Vulcan59:

I am not sure how best to go about setting up
projects using the forum. Perhaps some suggestions
could come from the SOTT developers/IT since they
understand how set up and manage projects but using
a forum will present some challenges or maybe there is a
better way?

Consider this structure - how can this work in a forum?
Maybe this is overcomplicating and could be simpler?

Sott Projects
+===============+ ...
| |
Data Mining & Visualization Project #2 ...
|
+=====+===+===+
| | | |
CoZion Dutroux FoS Fucilla
|
+=======+======== ...
| ...
Bush

The only reason I suggested the merge is because we may
not want this to be public, may not want to clutter the original
post, and maybe for other reasons.
 
Re: Data Mining and Data Visualization

some thing to keep int mind. when setting the expectation. no mean to discourage. This tool has LOT of potential based on their forum questions. people are using it even for brain mapping etc.

_http://forum.gephi.org/viewtopic.php?t=1816
Re: Which upgrade for best gephi performance?

Postby jonswords » 23 Apr 2012 21:01
The graphs i'm currently working with are c.15000 nodes and 10-12000 edges. This will likely increase in size in the near future.

jonswords

Posts: 3
Joined: 23 Oct 2011 14:27

Top
Re: Which upgrade for best gephi performance?

Postby seinecle » 24 Apr 2012 09:38
I would go for 8Gb RAM minimum and as good a graphic card as you can.

I've read somewhere that the roadmap for Gephi includes the development of GPU (graphic cards)-based computations, because this provides a huge boost in processing power (not just for the graphic rendering, but also for many sorts of computation). Even if these developments are surely not for the next 6 months, you could future-proof your laptop by having a very good GPU on it.

Best,

Clement
I wondered why it is slow on my laptop, when I imported example of 5000 nodes example.
 
Re: Data Mining and Data Visualization

I am finally starting to make sense out of this thread, and I am looking at Gephi and Neo4j. I don't yet understand how the network was built that produced the graph above, but I think it will be clearer once I have these things running. I had already been thinking about ways to build graphs from forum messages, and data collected that way might feed into these tools.
 
Re: Data Mining and Data Visualization

Neo4j is becoming my introduction to NoSQL databases. As an SQL/MDX developer I haven't had any reason to look at them, but it makes sense for this kind of data. Reading about Neo4j doesn't, however, answer my questions about software for building graphs from forum messages or other textual sources.

I am thinking about collecting data from the forum using either RSS or the SMF ".xml" feature. This would also make it possible to collect from other sources such as blogs using more or less the same code.

What I am trying to figure out now is what to collect. Possibilities I see so far include
  • Board ID
  • Topic ID
  • Message ID
  • Poster ID
  • Subject
  • Links to other posts
  • Links to other websites
  • Body text
  • Nouns & noun phrases
Most of these fields are specific to the forum (and maybe others like it), but the last three are more general. One of the things I would like to do is to analyze text for nouns and noun phrases, and build a relational database of those that are shared. I am not really sure how to go about it; all I can do is try it and refine as I go.

Some of these items represent nodes or node attributes in a network (IDs, subject, body) while others represent relationships (links, nouns). Noun and noun phrase relationships could be very useful for connecting things, but they don't seem to have an obvious "direction" and I am not sure how this might affect their use in Gephi. Or maybe they do at least sometimes have an implied direction deriving from the way the text in which they appear may be linked to other posts via quoting, in the case of forum posts. I could give them a direction based on time stamps, I suppose.
 
Re: Data Mining and Data Visualization

I managed to build something that downloads forum messages into a relational database automatically, a first step toward doing other things. The term extraction tool I am using is less flexible than I thought, and I am going to have to figure out a better way of doing that, or else come up with a different one. In the mean time at least I am collecting data.
 
Back
Top Bottom