Tech Support > Dr. Bizaramor Strikes Back
Data Mining and Data Visualization Tools
foofighter:
--- Quote from: Megan on May 02, 2012, 05:37:33 AM ---No kidding. I have been working on cleaning the body text by removing HTML tags. All of SMF's XML formats (XML, Atom, RSS) return the message body as HTML. I am trying to treat the HTML as XML and extract just the text and CDATA portions. It doesn't matter what it looks like because it is only seen by the term extraction component. Once I solve that problem I will go back to producing a standard XML graph output format.
--- End quote ---
For converting messy HTML to XML, TagSoup is your friend: http://ccil.org/~cowan/XML/tagsoup/ It's awesome for this kind of stuff!
--- Quote ---The term table is looking good, if a bit dull. The top words are rather ordinary, and I am adding some of them to the exclusion table. I am trying to capture all forum messages, not just select ones. It should be possible to filter the result to remove unwanted topics. The more message attributes I can make available, the more filtering options there will be.
--- End quote ---
What about what I suggested, i.e. the other way round where you have a manual list of words you are looking for, such as names of individuals. If using Neo, you could then tag those differently from, for example, organizations.
Megan:
If you capture everything, you can search for what you are looking for within it. If you only capture what you are looking for, you will miss anything you didn't think to look for.
I did quite a bit of cleanup on the message bodies just using an XmlTextReader. It's very fast -- much faster than the term extraction process, and it works because SMF is emitting valid XHTML. If I start to read from other sources, TagSoup could come in handy -- thanks!
Here is a sample from the term table:
--- Quote ---MessageID Term Score
339138 FDA 15.2995992834726
339013 feature 99.2432087578843
339101 feature 99.2432087578843
339392 feature 99.2432087578843
339083 February 15.2995992834726
339384 female body 26.4403154835855
339166 field 37.6521812635504
339174 field 37.6521812635504
339276 figure 11.5860272167683
339106 finding 18.7776052788641
339140 FIR sauna 11.5860272167683
339023 fish oil 11.5860272167683
338987 fishing 111.407162001129
339022 fishing 111.407162001129
339032 fishing 111.407162001129
339034 fishing 111.407162001129
339227 fishing 111.407162001129
339243 fishing 111.407162001129
339337 fishing 111.407162001129
339259 Flemish 11.5860272167683
339338 food 46.313497420346
339402 food 46.313497420346
339083 food chain 15.2995992834726
--- End quote ---
When a word repeats, that means it appears in more than one post. The numbers are the TF-IDF scores.
foofighter:
--- Quote from: Megan on May 02, 2012, 07:42:33 AM ---If you capture everything, you can search for what you are looking for within it. If you only capture what you are looking for, you will miss anything you didn't think to look for.
--- End quote ---
Good point! So, basically you could first create a "raw" data capture, then investigate it manually and mark the things you look for (in a particular instance), and then do further analysis based on that subset.
--- Quote ---I did quite a bit of cleanup on the message bodies just using an XmlTextReader. It's very fast -- much faster than the term extraction process, and it works because SMF is emitting valid XHTML. If I start to read from other sources, TagSoup could come in handy -- thanks!
Here is a sample from the term table:
--- Quote ---MessageID Term Score
339138 FDA 15.2995992834726
339013 feature 99.2432087578843
339101 feature 99.2432087578843
339392 feature 99.2432087578843
339083 February 15.2995992834726
339384 female body 26.4403154835855
339166 field 37.6521812635504
339174 field 37.6521812635504
339276 figure 11.5860272167683
339106 finding 18.7776052788641
339140 FIR sauna 11.5860272167683
339023 fish oil 11.5860272167683
338987 fishing 111.407162001129
339022 fishing 111.407162001129
339032 fishing 111.407162001129
339034 fishing 111.407162001129
339227 fishing 111.407162001129
339243 fishing 111.407162001129
339337 fishing 111.407162001129
339259 Flemish 11.5860272167683
339338 food 46.313497420346
339402 food 46.313497420346
339083 food chain 15.2995992834726
--- End quote ---
When a word repeats, that means it appears in more than one post. The numbers are the TF-IDF scores.
--- End quote ---
Nice!
Let me know if you need any suggestions on how to use Neo4j or visualizations related to it. I can easily find out from the team, if need be.
Megan:
--- Quote from: foofighter on May 02, 2012, 08:50:58 AM ---Let me know if you need any suggestions on how to use Neo4j or visualizations related to it. I can easily find out from the team, if need be.
--- End quote ---
I was planning to make the data available for download. I am not sure what to do with data from boards that are not public, however. I could suppress that data altogether (or control access to it), or I could limit the attributes included in the graph. The message bodies will not be part of the graph, but exposing the term list could provide a clue as to what was being discussed privately. I guess the best approach for now would be to suppress the private and access-limited boards.
My collector accesses the forum directly without using a web browser, and I did not attempt to pass my logon information, so I assume that the present database contains only public posts. Hopefully that means I can ignore the issue for now (after verifying my assumption), although I will certainly want to address it later on. What I might do is collect all public boards but only specific private or limited access boards. Detecting topics and replies that are moved from a public board to a private one could be a little tricky but I think I can do it.
So while collecting the data and building graphs are things that relate to my specialties, I would be more than happy to share figuring out the NoSQL and graph visualization parts of it, where I am starting out from scratch.
Megan:
A little more progress -- I build a nodelist:
--- Quote ---<node id="338991" label="Re: Podcasts et communiqués SOTT en français !" />
<node id="338992" label="Re: &quot;Life Without Bread&quot;" />
<node id="338993" label="Re: &quot;Life Without Bread&quot;" />
<node id="339000" label="Re: Getting a &quot;Handl&quot; on things? " />
<node id="339001" label="Re: &quot;Life Without Bread&quot;" />
<node id="339002" label="Re: Cryogenic Chamber Therapy" />
<node id="339003" label="Re: Leukemia" />
<node id="339005" label="Re: &quot;Life Without Bread&quot;" />
<node id="339006" label="Re: Laura's books at Amazon" />
<node id="339008" label="Re: &quot;Life Without Bread&quot;" />
<node id="339009" label="Re: Leukemia" />
<node id="339010" label="Re: Anonymous Message Attack Planned on Olympics 2012 by Government " />
<node id="339011" label="Re: &quot;Life Without Bread&quot;" />
<node id="339013" label="The Impossible " />
<node id="339014" label="Re: &quot;Life Without Bread&quot;" />
<node id="339015" label="Re: Iodine" />
--- End quote ---
I spent most of my time tonight figuring out the best way to write the data out, leaving little time to work on the actual data. The next steps are to add attributes to each node, build the edge list, wrap it all with a few more elements, and try to feed it into Gephi. The export process currently writes to a file, but it could potentially upload to my website, automatically.
Navigation
[0] Message Index
[#] Next page
[*] Previous page
Go to full version