Megan said:
I've been busy with other things for the past month, but I am trying to take some time this holiday weekend to complete the part of the data collector that will download an entire topic. I would like to see if I can build useful graphs of topics of interest using this tool.
Good Job Megan for the initiative and progress.
Initially I was excited with this project , but soon I realized lack of proper skill set to go further and couldn’t get time either.
Here are some thoughts while trying to visualize how this can work.
The main thing any data mining project is structured data and ability to accumulate data in the same consistent pattern and link it and present it. Based on the data in the thread, there is lot of skill set to do it and I tried to fit in the general Data warehouse model (while violating some basic DW principles for simplicity).
though I put different boxes in traditional DW sense , a single program accomplish end to end ( as Megan already did ) or
Let’s visualize the purpose as The ability to connect different entities ( people,events,organization,products) based on the individual entities entered , Link and present .
For this we need
• Choose consistent architecture ( DB, Tools ) for re usability. - DB could be simple Mysql(Neo4J) hosted on central location, extraction tools, presentation tools like Gephi, YED etc.
• To create a structured , reusable data structure – Data Model
• Consistent ID Generation – consistent across what ever the means(technology ) used. Having consistent tools helps in using consistent ID generation.
• Ability to Clean the data , merging the ID’s either programmatic way or Through some custom or available screen – Here I see a need to have a web screen based on the data model, that has the ability to display the DB content and gives the ability to update the ID’s etc. OR can be done through programmatically, if possible. OR we can use the Mysql screens itself or DML's to update the data.
There must be some ready made web screen’s available on the net, but I couldn’t find in quick search.
• Ability to bring relationship between ppl, org, events, products or using some custom screen (Manual) or using queries. – Here I see a need to have a web screen that gives ability to display DB table data and it gives list Items of entities that we can select different ID’s to bring relationship OR Programmatically build the relationship OR Manually build the data into relationship in Excel and upload
• Presentation part can be achieved by many means like create DB views, XML or any other format the tools like Gephi, Yed accept.
• A web screen that gives Feature like Linking like how linkedIn does it ( if we give some company name, it will gives the people who are in that company that linked to our profile through some intermediaries – Not a easy task but worth it ). This type of feature will be great. But Data quality becomes very important else it will become Junk In – Junk out. If we can organize the data in hierarchically, simple SQL will do the trick.
Though I felt web screens are useful for non technical people to clean the data, some of these can be done by MySql itself.
The structured data is represented by data model. I tried to give a shot at the model , this is very rough one, I voilated some standard rules of normalization to avoid the additional work.( Ignore the relationships 1:N etc) .
Note: We can add many additional information like , source ( node id, board id etc) , Graphs formatting info ( nodes width , length ) etc.
I can help in the data modeling aspect and other activities.