Tool to search in transcripts

Apei

The Force is Strong With This One
So I'm actually building a tool (you could call it a search engine) during my free time to search in transcripts pretty much like on the Ra website. I'm doing this because it seemed like a good idea and because I didn't found any equivalent for the cassiopaean. I hope I didn't do what was already done as I know there is already a search option on this forum.

Now it can serve multi purpose like counting words, letters, it can search for sentences or words separately from questions and anwsers, and also compare if you add other materials to it. There is nothing fancy to it as its only one database on my computer (it is locally and not shared) and mainly build in php, jquery.
So I will gladly provide information on it and share if this is allowed here (I can create a git but I'm not too fan).
I can improve it too if you have ideas !
 
Sounds like your search might offer a few helpful features, so please elaborate how your search would be better than just doing a forum search of that board which contains the transcripts? Thanks
 
JGeropoulas said:
Sounds like your search might offer a few helpful features, so please elaborate how your search would be better than just doing a forum search of that board which contains the transcripts? Thanks

Apei, do you mean something like this:http://www.lawofone.info/search.php

FWI, I do not know anything of the nuts and bolts of running/programing a web site such as this forum; I do know though it's a lot of work. There may be reasons why it can't work here, Apei - seem to recall a very recent discussion on this sort of thing (searches) and how it is easier to use a dedicated search engine than try and reconfigure the forum (perhaps pressure on the server/system load?) - will leave it to the more experienced to answer you.
 
I think this is a good idea on the surface apei. It would be a good idea to list what kind of improvements your search has over the forum search. Also I'm not sure of how you're loading the sessions into your local database but if you're scraping the forum you'll probably want to give a heads up to the admins here.

Potentially, you could overload the server if you're not careful or throttling your spiders. This is also one of many reasons why it's a good idea to share a git repo of the code here. Even if only private; more eyes are always better. One caveat though, if you're somehow accessing the private database that backs the forum (I doubt this) then you should send a PM to an Admn

Another point. The cyber world is on guard right now with WannaCry and other related worms going around so it would be considerate if you could share as many technical details as possible.

apei said:
I can improve it too if you have ideas !
One thing I've noticed personally is that it's easier to remember context from a session as opposed to remembering the exact quote. It would nice if a search could take context and match that to specific sessions. I think one way that could be done is if NLTK (Natural language toolkit) / (NLP natural language processing) was being used during the search.

http://blog.algorithmia.com/introduction-natural-language-processing-nlp/
NLP is a way for computers to analyze, understand, and derive meaning from human language in a smart and useful way. By utilizing NLP, developers can organize and structure knowledge to perform tasks such as automatic summarization, translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation.

Here is a library that could help facilitate the above idea. Unfortunately, it doesn't use PHP :(
 
voyageur said:
JGeropoulas said:
Sounds like your search might offer a few helpful features, so please elaborate how your search would be better than just doing a forum search of that board which contains the transcripts? Thanks

Apei, do you mean something like this:http://www.lawofone.info/search.php

FWI, I do not know anything of the nuts and bolts of running/programing a web site such as this forum; I do know though it's a lot of work. There may be reasons why it can't work here, Apei - seem to recall a very recent discussion on this sort of thing (searches) and how it is easier to use a dedicated search engine than try and reconfigure the forum (perhaps pressure on the server/system load?) - will leave it to the more experienced to answer you.

Okay, so I've not looked into details about the searching option on this forum so maybe I'm not fully aware of what it can possibly do.
I believe you cannot search into transcripts specifically and only permit a broad search on the forum or a topic only.

Now the tool provides a more efficient way to search in transcripts only :


- All the search can be done in a session, specific sessions(choose from a list) or all sessions
- You can search for a specific word; eg: I want to search for all the occurences of the word "foo" and I want to search it only in the answers.
You type "foo" in the searchbar you specify "search only in answers"(as a form of a checkbox) and it will return the result.

- You can search for a full sentence
- You can search for multiple words. Specify "foo bar" and it will search for them
- Add an option 'but also does not contain this "word" in the session|the questions|the answers' for exemple
- All the searchs can be case sensitive (optionnal)
- You can search for special characters like the quotes

- To those who are interested in numbers and statistics you can instead of searching, choose to count occurences of a word/letter or even a sentence for a session|specific sessions|all sessions
- Count all the letters in a session|specific sessions|all sessions and you can choose if it is only the Q or the A or both (I also might add an option to omit the space character)


I would like to think that mine is a little more user-friendly but it is indeed pretty much like http://www.lawofone.info/search.php (without the "count" option).


Now I'm not planning to alter or disturb the forum or replace the forum's search option. I now this option is in most case native from a forum's format.

It would be a better choice to implement the tool in an external website but this is reason why I've created the subject here, before doing anything, discussing it here.
I've created it because I've had a use for it (and had fun while creating it), I wanted to search for specific "raw data" if I may call it this way and compare with other I had stumbled upon .
Unfortunately I didn't have the time to re-read all the transcript, so I've built a mean to search. It was useful to me so why not other members, this is another reason why I'm posting here :)

trendsetter37 said:
Potentially, you could overload the server if you're not careful or throttling your spiders. This is also one of many reasons why it's a good idea to share a git repo of the code here. Even if only private; more eyes are always better. One caveat though, if you're somehow accessing the private database that backs the forum (I doubt this) then you should send a PM to an Admn

Another point. The cyber world is on guard right now with WannaCry and other related worms going around so it would be considerate if you could share as many technical details as possible.

apei said:
I can improve it too if you have ideas !
One thing I've noticed personally is that it's easier to remember context from a session as opposed to remembering the exact quote. It would nice if a search could take context and match that to specific sessions. I think one way that could be done is if NLTK (Natural language toolkit) / (NLP natural language processing) was being used during the search.

http://blog.algorithmia.com/introduction-natural-language-processing-nlp/

I think I'll create a git repo but I'm not familiar with it so I might need help.
About retrieving, so I was doing it this way : create a loop to curl all the sessions at once (here on the pleyades site because the urls are easier to predict)

eg: curl http://www.bibliotecapleyades.net/vida_alien/sess_cass/cass01.htm -o "cass01.txt" | tidy -n -asxml 2>/dev/null | xmlstarlet pyx

I can't do it here because the urls are by topics. I believe there are indeed other means to retrieve it but it could be detrimental to the site so I didn't, instead I'm copy pasting like a bot into the database directly.
After that I have multiple scripts running automatically on my server to parse the data and fill other tables (=> These tables are used for the search).


So everything I've build is stored locally, I have a CentOS 7 (running on Vmware) with MySQL and httpd, I'm copy pasting the data into the database and parsing with php functions (strpos, strlen, strrpos, substr, ...). The user will have to fill a searchbar with options ($_POST) , submit, then the webserver (httpd) will ask the database with mysqli functions.
Some parts are done in a rush (php is not my forté) as this is just a small utilitary.
As for WannaCry in my opinion this is kinda bogus as most (I'm speaking for medium/big enterprise) have Veeam backup or another kind of the sort. But it is true that the digital security is deplorable today, because efficiency/profit is prioritised over security. I guess this topic could fill another subject !

NLP is a way for computers to analyze, understand, and derive meaning from human language in a smart and useful way. By utilizing NLP, developers can organize and structure knowledge to perform tasks such as automatic summarization, translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation.

This is highly interesting, unfortunately I'm not a developers so I'm not really sure of how I could implement it ... Maybe after creating the git someone could. But I was thinking of associating key words (It can be done automatically with a correspondance table) on sessions to add a certain context to it, and by adding a suggestion option if the user choose to.
 
voyageur said:
There may be reasons why it can't work here, Apei - seem to recall a very recent discussion on this sort of thing (searches) and how it is easier to use a dedicated search engine than try and reconfigure the forum (perhaps pressure on the server/system load?) - will leave it to the more experienced to answer you.

Voyageur, said discussion is here: Search Function doesn't work
 
Apei said:
Okay, so I've not looked into details about the searching option on this forum so maybe I'm not fully aware of what it can possibly do.
If you click on the tab "Search" below the top banner, you get the advanced search window (vs. using the main search box on the right side of the top banner)
I believe you cannot search into transcripts specifically and only permit a broad search on the forum or a topic only.
You can go to Home/Esoterica/The Cassiopaean Experiment/Cassiopaean Sessions Transcripts, then do a search set for "search this topic" to get targeted search results. Beyond that, of course you're correct that you cannot limit a search to a specific transcript session. You can however open up an individual transcript session and use the "Find" function to comb through that session. As for anything more technical that's being discussed here, it's way over my head.
 
Back
Top Bottom