voyageur said:
JGeropoulas said:
Sounds like your search might offer a few helpful features, so please elaborate how your search would be better than just doing a forum search of that board which contains the transcripts? Thanks
Apei, do you mean something like this:http://www.lawofone.info/search.php
FWI, I do not know anything of the nuts and bolts of running/programing a web site such as this forum; I do know though it's a lot of work. There may be reasons why it can't work here, Apei - seem to recall a very recent discussion on this sort of thing (searches) and how it is easier to use a dedicated search engine than try and reconfigure the forum (perhaps pressure on the server/system load?) - will leave it to the more experienced to answer you.
Okay, so I've not looked into details about the searching option on this forum so maybe I'm not fully aware of what it can possibly do.
I believe you cannot search into transcripts specifically and only permit a broad search on the forum or a topic only.
Now the tool provides a more efficient way to search in transcripts only :
- All the search can be done in a session, specific sessions(choose from a list) or all sessions
- You can search for a specific word; eg: I want to search for all the occurences of the word "foo" and I want to search it only in the answers.
You type "foo" in the searchbar you specify "search only in answers"(as a form of a checkbox) and it will return the result.
- You can search for a full sentence
- You can search for multiple words. Specify "foo bar" and it will search for them
- Add an option 'but also does not contain this "word" in the session|the questions|the answers' for exemple
- All the searchs can be case sensitive (optionnal)
- You can search for special characters like the quotes
- To those who are interested in numbers and statistics you can instead of searching, choose to count occurences of a word/letter or even a sentence for a session|specific sessions|all sessions
- Count all the letters in a session|specific sessions|all sessions and you can choose if it is only the Q or the A or both (I also might add an option to omit the space character)
I would like to think that mine is a little more user-friendly but it is indeed pretty much like http://www.lawofone.info/search.php (without the "count" option).
Now I'm not planning to alter or disturb the forum or replace the forum's search option. I now this option is in most case native from a forum's format.
It would be a better choice to implement the tool in an external website but this is reason why I've created the subject here, before doing anything, discussing it here.
I've created it because I've had a use for it (and had fun while creating it), I wanted to search for specific "raw data" if I may call it this way and compare with other I had stumbled upon .
Unfortunately I didn't have the time to re-read all the transcript, so I've built a mean to search. It was useful to me so why not other members, this is another reason why I'm posting here :)
trendsetter37 said:
Potentially, you could overload the server if you're not careful or throttling your spiders. This is also one of many reasons why it's a good idea to share a git repo of the code here. Even if only private; more eyes are always better. One caveat though, if you're somehow accessing the private database that backs the forum (I doubt this) then you should send a PM to an Admn
Another point. The cyber world is on guard right now with WannaCry and other related worms going around so it would be considerate if you could share as many technical details as possible.
apei said:
I can improve it too if you have ideas !
One thing I've noticed personally is that it's easier to remember context from a session as opposed to remembering the exact quote. It would nice if a search could take context and match that to specific sessions. I think one way that could be done is if NLTK (Natural language toolkit) / (NLP natural language processing) was being used during the search.
http://blog.algorithmia.com/introduction-natural-language-processing-nlp/
I think I'll create a git repo but I'm not familiar with it so I might need help.
About retrieving, so I was doing it this way : create a loop to curl all the sessions at once (here on the pleyades site because the urls are easier to predict)
eg: curl http://www.bibliotecapleyades.net/vida_alien/sess_cass/cass01.htm -o "cass01.txt" | tidy -n -asxml 2>/dev/null | xmlstarlet pyx
I can't do it here because the urls are by topics. I believe there are indeed other means to retrieve it but it could be detrimental to the site so I didn't, instead I'm copy pasting like a bot into the database directly.
After that I have multiple scripts running automatically on my server to parse the data and fill other tables (=> These tables are used for the search).
So everything I've build is stored locally, I have a CentOS 7 (running on Vmware) with MySQL and httpd, I'm copy pasting the data into the database and parsing with php functions (strpos, strlen, strrpos, substr, ...). The user will have to fill a searchbar with options ($_POST) , submit, then the webserver (httpd) will ask the database with mysqli functions.
Some parts are done in a rush (php is not my forté) as this is just a small utilitary.
As for WannaCry in my opinion this is kinda bogus as most (I'm speaking for medium/big enterprise) have Veeam backup or another kind of the sort. But it is true that the digital security is deplorable today, because efficiency/profit is prioritised over security. I guess this topic could fill another subject !
NLP is a way for computers to analyze, understand, and derive meaning from human language in a smart and useful way. By utilizing NLP, developers can organize and structure knowledge to perform tasks such as automatic summarization, translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation.
This is highly interesting, unfortunately I'm not a developers so I'm not really sure of how I could implement it ... Maybe after creating the git someone could. But I was thinking of associating key words (It can be done automatically with a correspondance table) on sessions to add a certain context to it, and by adding a suggestion option if the user choose to.