name said:...@Megan
I made the Fucilla Graph by simply scraping the data about persons and organizations by hand from the first page of the thread and saving it as CSV, then doing the same for the edges, followed by some adjustment by hand. I am looking forward to see what you are doing.
I do SQL, PL/SQL and data design (Postgres/Oracle), and I do not yet understand NoSQL/graph-oriented databases.
...
foofighter said:...To me the big problem seems to be finding out a good way to get the data in. A Semantic Wiki is one option, which then could be fed to Neo4j or any other visualization tools.
For converting messy HTML to XML, TagSoup is your friend: http://ccil.org/~cowan/XML/tagsoup/ It's awesome for this kind of stuff!Megan said:No kidding. I have been working on cleaning the body text by removing HTML tags. All of SMF's XML formats (XML, Atom, RSS) return the message body as HTML. I am trying to treat the HTML as XML and extract just the text and CDATA portions. It doesn't matter what it looks like because it is only seen by the term extraction component. Once I solve that problem I will go back to producing a standard XML graph output format.
What about what I suggested, i.e. the other way round where you have a manual list of words you are looking for, such as names of individuals. If using Neo, you could then tag those differently from, for example, organizations.The term table is looking good, if a bit dull. The top words are rather ordinary, and I am adding some of them to the exclusion table. I am trying to capture all forum messages, not just select ones. It should be possible to filter the result to remove unwanted topics. The more message attributes I can make available, the more filtering options there will be.
MessageID Term Score
339138 FDA 15.2995992834726
339013 feature 99.2432087578843
339101 feature 99.2432087578843
339392 feature 99.2432087578843
339083 February 15.2995992834726
339384 female body 26.4403154835855
339166 field 37.6521812635504
339174 field 37.6521812635504
339276 figure 11.5860272167683
339106 finding 18.7776052788641
339140 FIR sauna 11.5860272167683
339023 fish oil 11.5860272167683
338987 fishing 111.407162001129
339022 fishing 111.407162001129
339032 fishing 111.407162001129
339034 fishing 111.407162001129
339227 fishing 111.407162001129
339243 fishing 111.407162001129
339337 fishing 111.407162001129
339259 Flemish 11.5860272167683
339338 food 46.313497420346
339402 food 46.313497420346
339083 food chain 15.2995992834726
Good point! So, basically you could first create a "raw" data capture, then investigate it manually and mark the things you look for (in a particular instance), and then do further analysis based on that subset.Megan said:If you capture everything, you can search for what you are looking for within it. If you only capture what you are looking for, you will miss anything you didn't think to look for.
Nice!I did quite a bit of cleanup on the message bodies just using an XmlTextReader. It's very fast -- much faster than the term extraction process, and it works because SMF is emitting valid XHTML. If I start to read from other sources, TagSoup could come in handy -- thanks!
Here is a sample from the term table:
MessageID Term Score
339138 FDA 15.2995992834726
339013 feature 99.2432087578843
339101 feature 99.2432087578843
339392 feature 99.2432087578843
339083 February 15.2995992834726
339384 female body 26.4403154835855
339166 field 37.6521812635504
339174 field 37.6521812635504
339276 figure 11.5860272167683
339106 finding 18.7776052788641
339140 FIR sauna 11.5860272167683
339023 fish oil 11.5860272167683
338987 fishing 111.407162001129
339022 fishing 111.407162001129
339032 fishing 111.407162001129
339034 fishing 111.407162001129
339227 fishing 111.407162001129
339243 fishing 111.407162001129
339337 fishing 111.407162001129
339259 Flemish 11.5860272167683
339338 food 46.313497420346
339402 food 46.313497420346
339083 food chain 15.2995992834726
When a word repeats, that means it appears in more than one post. The numbers are the TF-IDF scores.
foofighter said:Let me know if you need any suggestions on how to use Neo4j or visualizations related to it. I can easily find out from the team, if need be.
<node id="338991" label="Re: Podcasts et communiqués SOTT en français !" />
<node id="338992" label="Re: &quot;Life Without Bread&quot;" />
<node id="338993" label="Re: &quot;Life Without Bread&quot;" />
<node id="339000" label="Re: Getting a &quot;Handl&quot; on things? " />
<node id="339001" label="Re: &quot;Life Without Bread&quot;" />
<node id="339002" label="Re: Cryogenic Chamber Therapy" />
<node id="339003" label="Re: Leukemia" />
<node id="339005" label="Re: &quot;Life Without Bread&quot;" />
<node id="339006" label="Re: Laura's books at Amazon" />
<node id="339008" label="Re: &quot;Life Without Bread&quot;" />
<node id="339009" label="Re: Leukemia" />
<node id="339010" label="Re: Anonymous Message Attack Planned on Olympics 2012 by Government " />
<node id="339011" label="Re: &quot;Life Without Bread&quot;" />
<node id="339013" label="The Impossible " />
<node id="339014" label="Re: &quot;Life Without Bread&quot;" />
<node id="339015" label="Re: Iodine" />
name said:...I made the Fucilla Graph by simply scraping the data about persons and organizations by hand from the first page of the thread and saving it as CSV, then doing the same for the edges, followed by some adjustment by hand. I am looking forward to see what you are doing...
Megan said:While I am at it, I should include the board & topic hierarchy above the messages.
Would that be a useful graph? Now that I have a basic set of components for building graphs, I can construct them any way that would work for visualization, but what would work well?
Cool! Congratulaitons for your advances and thanks for your interest in this and the great work!Megan said:I am able to write out a GEXF file now and read it into Gephi.
I don't know. I suppose that it depends on what one wants to understand with such a graph.Megan said:... Would that be a useful graph? Now that I have a basic set of components for building graphs, ...
parallel said:Most of the tech jargon and acronyms here is over my head, but find it an interesting topic and aim to learn more about it (recently did my first mysql).
Is it possible to do a layered graph, so that one could choose what level of complexity to view or combine. Like board/topic data on one layer, names in network on another, additional related data on a third layer and so forth ? That way one could make presentations match the presentation context by adding/subtracting layers.
name said:Cool! Congratulaitons for your advances and thanks for your interest in this and the great work!Megan said:I am able to write out a GEXF file now and read it into Gephi.
I don't know. I suppose that it depends on what one wants to understand with such a graph.Megan said:... Would that be a useful graph? Now that I have a basic set of components for building graphs, ...
My original idea was to use graphs to help understand the relationship of people, organizations... mentioned in threads such as the Fucilla one, or the longer Rense thread (for example), where people discuss a subject and apport information about people, entities, events ... over time. I am currently trying to make knime get the entities (see http://en.wikipedia.org/wiki/Named_entity_recognition) from a text, and in next step I'll look if can make it find relations between these entities and if there is a way to tag them in a sensible way. What I want is to eventually get a graph which more-or-less resembles something I'd also do by hand.
The thing with the "parallel edges" to which you allude is the weight of the relation between those particular two nodes. You must use the "weight" field of the edge so that gephi can process it and count up each time you find another relation between these two nodes - see the GEXF draft primer at "2.3.3 Declaring an Edge" on page 7.
You may also want to look at http://gexf.net if you haven't discovered it yet.