TransCript (Session Searching Tool)

darksai

Jedi Master
Introduction

I thought it would be nice for the forum to have a more structured, dynamic and efficient way of searching through the C's session transcripts, and so I sat down one weekend afternoon put some ideas down that I thought would be useful for an application that serves that purpose. In a nutshell, the way envisage it would be to have sessions broken down into "Question and Answer" units, or QAs for short, which would displayed as search results (similar to the lawofone site, except.. well, better :cool2: ) and be able to search these by keywords, groups of keywords (metawords), date, etc., which would be tabled in a database to make it all a lot smoother. Another feature I thought of would be to store start and end points for sequential QAs containing the same keywords, forming QA "blocks" (QAB), which would allow one to search for complete conversations within a session on a keyword or metaword. I should add this is not a typical "text-based" kind of search program, since we already have that on the forum (i.e. using advanced search, checking the transcript board only, and putting "Laura" in the "By User" field). This is more for quick access to all mentions of a particular word, topic, or group thereof, while excluding as much of the text that is not relevant to the search using predefined terms, so you'd be selecting words from a list and have various options available rather than typing what you want in a text field.

Before I go further, allow me to explicate some definitions:

  • Session: A full session transcript, identified by its date.
  • QA: A typical question and answer pair, considered as a whole, uniquely identified by a serial number across all QAs, from oldest to most recent.
  • QAB: A sequence of QAs which are contain the same search-word.
  • Result: Sessions, QAs and QABs.
  • Search-word: Any word or term that is used to return results.
  • Keyword: A search-word that is in the actual text or manually added to be associated with a QA (like a "tag", but it that sounds boring as well that it is logically the same thing)
  • Metaword: A search-word that combines multiple keywords. It is not a really a "word" per se but actually an abstract grouping (I'm hesitant to use a word like "concept", . It is possible to make so that a keyword can be associated to multiple metawords, though care would have to be taken with this but at the same time might yield interesting results.
  • Metasearch: When a metasearch is executed, all the results for keywords belonging to the metaword of the keywords in the search will be returned. If we allow multiple metawords per keyword, it could be recursive to a specified depth as well (i.e. search all the metawords of the resulting keywords of the first metawords, etc.)
Application Features

These are what I would consider the basic features, more powerful tools, such as graphical displays of data, that build on these can always be added later (I'm a big fan of both simplicity and extensibility ;) )

  • Load one or more formatted transcripts into a database from text, automatically associating existing keywords with new QAs.
  • Add keywords that associate with QAs where the word exists in the body of the QA text.
  • Create relationships between keywords allowing for convenient and organized expanded searches, aka metawords
  • Display a list of results based one or more search-words.
  • Options for results by Session, QA or QAB.
  • Options to exclude keywords if they appear in the text and/or from meta-search.

Data Model

This is what I have so far for the database tables. I've tried to keep it as little repetition of data as possible and as far as I can tell, is all that is needed to implement the features listed above.

Code:
QA Table
------------------
| QA* | QA text* | 
------------------

Session Table
-----------------------------------------
| Session (date)* | first QA* | last QA* |
-----------------------------------------

Search Table
-----------------------------
| QA | keyword*? | metaword | 
-----------------------------

QAB Table
----------------------------------
| searchword | start QA | end QA |
----------------------------------

* = unique per row
*? = maybe unique per row

This is still very much a proto-design, so suggestions and ideas are welcome, though I think it would be best to keep it to details surrounding the core features and data model for now (not to say that suggestions for more core features aren't welcome :) ). GUI designs, on the other hand, are open game. Also I realize the name could be a little more creative too..

For anyone that is interested in working on this, my personally preferred language is java with Netbeans IDE (though if anyone knows a better one for wysiwyg GUIs and integration, let me know!), simply because it'll run on any platform and it's pretty easy to code with. As for the database, ideally it would be hosted on the Cass server, if this is ok with those in charge of that, so whatever DB software is being used there (I would guess (hope :halo:) MySQL) would have to do. I can do simple ER design, but I'm really not proficient at all when comes to writing SQL queries (disclaimer). I haven't gone much beyond what I've written here besides some thought on a few essential algorithms, most of which are quite implicit as some of you may see from the data model. If there is enough interest and support for this, I can get to work on the application's architecture and class structure, which is my favorite part and kinda my specialty.. osit :halo:

And lastly.. please feel free to ask me to clarify anything that isn't making sense.
 
This is similar to what I did in SQL Server. My approach was strictly a proof of concept (and it didn't "prove" very well, unfortunately), and the data model wasn't very good because I didn't go through a modeling process. I could extract an ER diagram, but I don't think that would be very helpful in itself. Some high level modeling work, done in the forum or live (I'm not sure how adaptable the process would be to forums), could be very helpful. The first thing I would do, however, is create a vision document. (There are other development approaches that can work just as well -- this happens to be what I do.)

The advantages of SQL Server were that it provides term extraction and full text search capabilities and, for me, it is what I work with every day -- that was the main reason I used it. I have written some very "hairy queries" in my career and I can tell you that that is not the answer to anything except perhaps in application development where it is sometimes needed and nothing else will do. I work with government databases that are not designed to work with other databases, and it gets very hairy, but that shouldn't be an issue here. You don't want to have to sit down and write complex SQL queries just to do a search.

I can see some potential major issues with trying to host and maintain an application on the cass servers. It's not like a new transcript comes out every day. You can download the transcripts to a suitable server, parse, index, and format them, and upload them to wherever, for direct access by search and visualization tools, or to a graph server (which might be faster), and/or to a DB server with good full-text search capabilities (provided by a hosting service). This is a bit too far afield from what I do for a living, however, for me to be making specific software recommendations.
 
Saieden said:
For anyone that is interested in working on this, my personally preferred language is java with Netbeans IDE (though if anyone knows a better one for wysiwyg GUIs and integration, let me know!), simply because it'll run on any platform and it's pretty easy to code with.

I am a newbie web developer who has worked on exactly one web application in his life, that too in these last three months. I used the Vaadin framework (https://vaadin.com/home) to develop the UI for the application. All the code is written in Java using the Eclipse IDE. Vaadin provides a good set of in-built UI widgets and tools, so you don't have to design your own. It implements AJAX and is primarily used for developing server-side applications. I'll be glad to pitch in and help with the development, even though I'm pretty inexperienced right now, I'll be glad to learn and pick it up as we go. You can check out some of the demos available on Vaadin's website: https://vaadin.com/demo if you are interested. Also, although I'm not exactly sure about how to use this feature, but Vaadin provides a SQLContainer object for working with databases: https://vaadin.com/book/vaadin7/-/page/sqlcontainer.html. I've found the Vaadin framework to be easy enough to understand (for a newbie developer like myself), because of all the powerful in-built features and also because all of the code is written in Java. I think a user-interface can also be developed pretty quickly using this framework.

Like I said, I'd love to help, but I also need to learn a lot first, which I'll gladly do. And I haven't really worked with algorithms/data structures/sql much as well. From what I could understand from looking at your design doc, I think the major work will involve deciding how to break up a session into smaller QAs and QABs and subsequently linking them with the searchwords and most expansive metawords using an algorithm. I'll have to play with your design in my head a bit more and it'll be good to have a class structure diagram of the application logic for that. I am excited to see how this works out. :thup:
 
Megan said:
This is similar to what I did in SQL Server. My approach was strictly a proof of concept (and it didn't "prove" very well, unfortunately), and the data model wasn't very good because I didn't go through a modeling process. I could extract an ER diagram, but I don't think that would be very helpful in itself. Some high level modeling work, done in the forum or live (I'm not sure how adaptable the process would be to forums), could be very helpful.

There could be some use, it would be shame to not at least try, though maybe at later stage only.

The first thing I would do, however, is create a vision document. (There are other development approaches that can work just as well -- this happens to be what I do.)

Agreed, that'll help a lot to give others an idea of where and how they might be able to contribute and what they can (hopefully) look forward to :)

The advantages of SQL Server were that it provides term extraction and full text search capabilities and, for me, it is what I work with every day -- that was the main reason I used it. I have written some very "hairy queries" in my career and I can tell you that that is not the answer to anything except perhaps in application development where it is sometimes needed and nothing else will do. I work with government databases that are not designed to work with other databases, and it gets very hairy, but that shouldn't be an issue here. You don't want to have to sit down and write complex SQL queries just to do a search.

Do you have experience with the Express version? The way I envision this, is that the database (and queries) is kept as simple as possible so that working on the project, as well as experimentation with personal customization, is accessible to more developers on the forum, with the bulk of the data manipulation being done on the client application. Any developer should be able to setup a local database with relatively quickly and easily, load the source text (or a subset thereof) through the app and immediately have their own sandbox to play with features and extensions (for example).

I can see some potential major issues with trying to host and maintain an application on the cass servers. It's not like a new transcript comes out every day. You can download the transcripts to a suitable server, parse, index, and format them, and upload them to wherever, for direct access by search and visualization tools, or to a graph server (which might be faster), and/or to a DB server with good full-text search capabilities (provided by a hosting service). This is a bit too far afield from what I do for a living, however, for me to be making specific software recommendations.

Only the database itself would be hosted on the server. I would like to avoid as many complicating factors, such as browser compatibility, as possible, also the main reason for wanting to do it in java.

chrismcdude said:
I am a newbie web developer who has worked on exactly one web application in his life, that too in these last three months. I used the Vaadin framework (https://vaadin.com/home) to develop the UI for the application. All the code is written in Java using the Eclipse IDE. Vaadin provides a good set of in-built UI widgets and tools, so you don't have to design your own. It implements AJAX and is primarily used for developing server-side applications. I'll be glad to pitch in and help with the development, even though I'm pretty inexperienced right now, I'll be glad to learn and pick it up as we go. You can check out some of the demos available on Vaadin's website: https://vaadin.com/demo if you are interested. Also, although I'm not exactly sure about how to use this feature, but Vaadin provides a SQLContainer object for working with databases: https://vaadin.com/book/vaadin7/-/page/sqlcontainer.html. I've found the Vaadin framework to be easy enough to understand (for a newbie developer like myself), because of all the powerful in-built features and also because all of the code is written in Java. I think a user-interface can also be developed pretty quickly using this framework.

The problem with Vaadin however is that being web-based, it would mean we'd have to run a server for both the database and html (apache) on our development environments, as well as that it could clutter some the code. Also, from what I can see so far, using Vaadin means it would be a "Vaadin Application" which suggests the core logic, gui and data model would be more tightly coupled and that would likely limit the potential for flexibility in the architecture especially.

Like I said, I'd love to help, but I also need to learn a lot first, which I'll gladly do. And I haven't really worked with algorithms/data structures/sql much as well. From what I could understand from looking at your design doc, I think the major work will involve deciding how to break up a session into smaller QAs and QABs and subsequently linking them with the searchwords and most expansive metawords using an algorithm. I'll have to play with your design in my head a bit more and it'll be good to have a class structure diagram of the application logic for that. I am excited to see how this works out. :thup:

Thank you for the support :) I would be happy help you learn along the way, I was quite fortunate to have had a very broad scope of exposure during varsity, although I'm not particularly "experienced" by most standards, so it would help me too, both in grounding my computer science knowledge, and more importantly imo, learning to teach in general.

With the data model I've laid out, you wouldn't really need any overall class structure to implement the basic features at all, you could do it all quite simply with just a few arrays and maybe one to two object classes to group some the data together. The structure also implies that the gist of some of the algorithms, especially how the sessions are broken up, are implicitly defined, though of course there are near infinite ways they could be implemented. How else would I have gotten to the model I made without "processing" the general session structure in the first place? It could be a useful exercise to try and figure out how I might've come about it yourself, but don't hesitate to ask me any questions if you're stuck :)
 
Saieden said:
...Do you have experience with the Express version? The way I envision this, is that the database (and queries) is kept as simple as possible so that working on the project, as well as experimentation with personal customization, is accessible to more developers on the forum, with the bulk of the data manipulation being done on the client application. Any developer should be able to setup a local database with relatively quickly and easily, load the source text (or a subset thereof) through the app and immediately have their own sandbox to play with features and extensions (for example).

I do all my development using SQL Server 2012 Developer Edition (DE), which costs about $50 or so retail (I obtain it through an MSDN subscription that gives me just about everything MS has, and my organization pays for it). It is a full Enterprise Edition (all enterprise features enabled) version of SQL Server but cannot, obviously, be used for production servers. It wouldn't make any sense to me to save $50 and use SQL Server Express (or whatever they call it these days), so I use that only when it comes bundled with something and I can't avoid it.

I think DE is probably fine to use for offline work and, as I mentioned before, transcript extraction need not be a real-time online process. Querying and visualizing are a different matter, and I suspect that a specialized server to which the processed data was uploaded might be the best approach for that (I certainly can't say for sure, with the little experience I have in that area). The only issue I am aware of there is that running specialized servers can violate a hosting service's terms of service. They are touchy about that for obvious reasons, but I think arrangements could be made.

If, by the way, this were a commercial project then I would not use DE for anything other than development. The full EE version, however, could be licensed by FOTCM at very low cost, not that it matters. As you can see, I am not eager to use "free" tools and servers on the back end, when substantial ETL processing is involved. I think they are fine for a simple website back end, however.

The problem with Vaadin however is that being web-based, it would mean we'd have to run a server for both the database and html (apache) on our development environments, as well as that it could clutter some the code. Also, from what I can see so far, using Vaadin means it would be a "Vaadin Application" which suggests the core logic, gui and data model would be more tightly coupled and that would likely limit the potential for flexibility in the architecture especially.

For PHP/Apache/MySQL development (which I do in some of my volunteer work) I use XAMPP on my Mac. It couldn't be simpler to install. If I developed a mod for SMF, that is how I would do it.
 
Saieden said:
The way I envision this, is that the database (and queries) is kept as simple as possible so that working on the project, as well as experimentation with personal customization, is accessible to more developers on the forum, with the bulk of the data manipulation being done on the client application. Any developer should be able to setup a local database with relatively quickly and easily, load the source text (or a subset thereof) through the app and immediately have their own sandbox to play with features and extensions (for example).

Only the database itself would be hosted on the server. I would like to avoid as many complicating factors, such as browser compatibility, as possible, also the main reason for wanting to do it in java.

So I'm assuming you're looking for the application to run on the client-side rather than the server-side based on your replies to Megan. Not having much knowledge of client-side development, I would now wait for further ideas to develop first, in order to improve my understanding of how the application UI and architecture is going to be designed.

The problem with Vaadin however is that being web-based, it would mean we'd have to run a server for both the database and html (apache) on our development environments, as well as that it could clutter some the code. Also, from what I can see so far, using Vaadin means it would be a "Vaadin Application" which suggests the core logic, gui and data model would be more tightly coupled and that would likely limit the potential for flexibility in the architecture especially.

Okay, that is true. So I guess I'll have to wait and see what approach you and other experienced developers are going to follow.

With the data model I've laid out, you wouldn't really need any overall class structure to implement the basic features at all, you could do it all quite simply with just a few arrays and maybe one to two object classes to group some the data together. The structure also implies that the gist of some of the algorithms, especially how the sessions are broken up, are implicitly defined, though of course there are near infinite ways they could be implemented. How else would I have gotten to the model I made without "processing" the general session structure in the first place? It could be a useful exercise to try and figure out how I might've come about it yourself, but don't hesitate to ask me any questions if you're stuck :)

I will admit here that I feel pretty dumb now that I didn't understand the concept of a QA earlier. My idea of a QA was closer to what you are saying when you talk about a QAB as a sequence of QAs from a specific transcript.

Please correct me if I'm wrong, but when you break the transcript up into question-answer pairs, each of these pairs is one QA, right? The question, and the answer that follows immediately after it, forms one QA. And a QAB will be a particular sequence of consecutive QAs in a session, from a start QA to end QA, where both of these two QAs contain a particular searchword. Now it makes a lot of sense. :-[

So how are you going to decide which keyword to map to a particular QA initially? If this is done manually, it will probably take a lot of time. Or are you going to define a list of keywords in the beginning which will be then used to do this QA-keyword mapping based on whether a QA contains that keyword? Also, one QA can contain multiple keywords, right? In that case, the search table cannot have a unique keyword per row.

I also wanted to discuss the idea of using metawords a little bit. Wouldn't a metaword also have to be a keyword in most cases? What will be the corresponding metaword for such a metaword as one which is also a keyword? And so on, by recursion this will never stop. If you say that the metaword of this metaword, is going to be the metaword itself (I'm sorry :P), then you will end up dividing the space of keywords into discrete groups, each characterized by a single metaword, and this may not be realistic. Therefore using metawords to relate keywords, might not be a good idea osit. But then again, I might have completely missed some important point here. :huh:
 
Megan said:
I do all my development using SQL Server 2012 Developer Edition (DE), which costs about $50 or so retail (I obtain it through an MSDN subscription that gives me just about everything MS has, and my organization pays for it). It is a full Enterprise Edition (all enterprise features enabled) version of SQL Server but cannot, obviously, be used for production servers. It wouldn't make any sense to me to save $50 and use SQL Server Express (or whatever they call it these days), so I use that only when it comes bundled with something and I can't avoid it.

I think DE is probably fine to use for offline work and, as I mentioned before, transcript extraction need not be a real-time online process. Querying and visualizing are a different matter, and I suspect that a specialized server to which the processed data was uploaded might be the best approach for that (I certainly can't say for sure, with the little experience I have in that area). The only issue I am aware of there is that running specialized servers can violate a hosting service's terms of service. They are touchy about that for obvious reasons, but I think arrangements could be made.

If, by the way, this were a commercial project then I would not use DE for anything other than development. The full EE version, however, could be licensed by FOTCM at very low cost, not that it matters. As you can see, I am not eager to use "free" tools and servers on the back end, when substantial ETL processing is involved. I think they are fine for a simple website back end, however.

There wouldn't be much ETL processing involved, I don't think, since the data would mostly be processed manually and locally by the application which would simplify the load queries. For loading a transcript, for example, the manual part would just be copy-paste into a text and then some minor formatting. I think a basic MySQL database should be enough for that.

For PHP/Apache/MySQL development (which I do in some of my volunteer work) I use XAMPP on my Mac. It couldn't be simpler to install. If I developed a mod for SMF, that is how I would do it.

If it becomes web-based, then yes, that is probably the best option.

chrismcdude said:
So I'm assuming you're looking for the application to run on the client-side rather than the server-side based on your replies to Megan. Not having much knowledge of client-side development, I would now wait for further ideas to develop first, in order to improve my understanding of how the application UI and architecture is going to be designed.

Yes, that is my plan. I'm going to do my best to keep it modular, so it'll be some kind of model-view-controller architecture in all likelihood.

Okay, that is true. So I guess I'll have to wait and see what approach you and other experienced developers are going to follow.

We have a public holiday on Monday, so I'm hoping to get the vision document and a good amount of high-level design done over the long weekend.

I will admit here that I feel pretty dumb now that I didn't understand the concept of a QA earlier. My idea of a QA was closer to what you are saying when you talk about a QAB as a sequence of QAs from a specific transcript.

Try not to worry, speedy intuition takes a while to develop, not mention that hindsight is always 20/20 :)

Please correct me if I'm wrong, but when you break the transcript up into question-answer pairs, each of these pairs is one QA, right? The question, and the answer that follows immediately after it, forms one QA. And a QAB will be a particular sequence of consecutive QAs in a session, from a start QA to end QA, where both of these two QAs contain a particular searchword. Now it makes a lot of sense. :-[

Exactly right.

So how are you going to decide which keyword to map to a particular QA initially? If this is done manually, it will probably take a lot of time. Or are you going to define a list of keywords in the beginning which will be then used to do this QA-keyword mapping based on whether a QA contains that keyword?

When you add a new keyword, whether it's a list or just one, all the QAs would be searched and added to the search table. The main purpose of manually adding keywords to QAs would be to make "complete" QABs where some of the in between QAs don't contain the keyword (or metaword, for that matter).

Also, one QA can contain multiple keywords, right? In that case, the search table cannot have a unique keyword per row.

I overlooked that, thanks :) The "unique keyword" was supposed to be for having a keyword only mapped to one metaword. Having thought more about it, I think that would be too limiting, as you might want to have types of metawords, such as conceptual links and synonyms.

I also wanted to discuss the idea of using metawords a little bit. Wouldn't a metaword also have to be a keyword in most cases? What will be the corresponding metaword for such a metaword as one which is also a keyword? And so on, by recursion this will never stop.

The metaword is the group itself that one or more keywords belong to. Only in the case that a keyword has no relationship to any other keyword, would a keyword "be a metaword", if that makes sense. So [density, realm, dimension] would be a metaword and [density, 1D, 2D, 3D, 4D, 5D, 6D, 7D] would be a totally different metaword. If you metasearch density, you would get back all the results from words in both sets, but realm would only give you the first set's results. If it was made recursive, then the iterative depth would be configurable for the sake of sanity. Also, there would be exclusions, so you could search realm recursively, exclude density, and you would get none of the second set but still results from other metawords containing dimension.

If you say that the metaword of this metaword, is going to be the metaword itself (I'm sorry :P), then you will end up dividing the space of keywords into discrete groups, each characterized by a single metaword, and this may not be realistic. Therefore using metawords to relate keywords, might not be a good idea osit. But then again, I might have completely missed some important point here. :huh:

Yeah, you're right, this is what I realized would be the result of having a keyword only relate to a single metaword; it would partition the searches and wouldn't have much room to "play" with, which is kinda the point of doing all this in the first place ;D
 
dear laura, dear forum. i was looking for a way to extract the information contained in the session transcripts made by laura. so, i took each transcript, copied the answers provided by the cs, and sorted these alphabetically. however, i neglected to provide the date of each answer. maybe somebody else could redo this work and add the session date. i attach my file for you information and use.
 

Attachments

Back
Top Bottom