Session transcripts as one big HTML file

Joe · Jun 20, 2020

goyacobol said:
Sorry for my rant I reacted to the general questioning and replied to your question which was perfectly valid. I do not think you personally expected "a free lunch". I see that the thread is networking fine now. We can possibly learn how to create a useful tool/utility that can be used off-line in times of limited internet access.

Have you ever thought about changing your avatar goycobol? I've often thought that, even if it may be how you feel sometimes, having something a bit more inspiring might be a better idea, like something that represents where you want to be rather than where you feel you are (sometimes). Just a suggestion.

Joe · Jun 20, 2020

Really nice work KS! Btw, there are over 1 million words in those sessions, which makes 3,258 word doc pages using Times new roman font size 12.

3DStudent · Jun 20, 2020

Thanks for the HTML file, KS. It looks handy, along with the Notepad++ searching.

KJS · Jun 20, 2020

chrismcdude said:
I was going to post the same thing! I have scraped the Cassiopaean Experiment transcripts that are available online (using a basic scraper that I wrote in Python) and stored them as individual text files on my computer. I use a feature in Notepad++ (a free application) called "Find in Files" to search for text inside the transcripts. Results are sorted by transcript date (which is the name of each session file) and include the sentences in which the search text occurs. When I click on any result, it opens the session that contains the search text.
Attaching an image below for reference-

View attachment 37087

@KS 's single page HTML with all the transcripts is also really cool. Love the clean formatting. Thanks a lot for sharing!!

You definitely need to share your work. How did you managed to transform HTML content to the plain text? Readability, by hand or maybe both?

goyacobol said:
It's is not a lack of gratitude. It was an overreaction. My apologies to @Luks for reacting to one particular question. I chose a poor way to express a general observation. There is so much technology that we use without thinking about the design effort that is required to make it efficient. I felt bad for @KS sharing a "free" utility and being flooded with questions but that is a part of refining an application.

I suppose everyone realizes the code may need some more testing and revisions but I think it may be useful as it is for some.

No need to feel bad for me, but thanks for the support! :) I was expecting feedback, because being a new member and posting some big HTML file out of nowhere might be viewed as a potentially malicious activity. Also, nobody asked for a such thing in the first place.

So far, I've identified the following issues:

31 Oct 2001 - missing session
3 Sep 2008 - session scraped incorrectly (quoted)
22 Oct 2008 - session scraped incorrectly (quoted)
3 Jan 2009 - session scraped incorrectly (quoted)
9 Jun 2009 - session scraped incorrectly (quoted)
22 Feb 2010 - session scraped incorrectly (2nd post)

Trobar · Jun 20, 2020

Thank you very much! This is excellent! :thup:

tschai · Jun 20, 2020

goyacobol said:
This technology is not as simple as many think and @KS has done a great service for those who just expect to have everything on a silver platter. I would just say thanks to @KS for giving his/her best. If "time" permits @KS may find a way to tweak the application. But consider this process is a process of refinement. Believe me, I know the hours and effort it takes to deliver a quality product. Those who expect a "free lunch" will be always disappointed. I think @Scottie would understand. If we work together we can do great things but show a little appreciation when possible.

I agree this is amazing work- thank you KS!

Arwenn · Jun 20, 2020

KS said:
Can you elaborate more or maybe post some screenshot while encountering that issue?

I am using an iPad, here are some screenshots of trying to open the updated file on Chrome & Mega (app).

Here is a screenshot of the older file opened in Mega.

Maybe the updated file is too big for Mega, but I dunno why it won’t open in Chrome for me.

And once again, thank you very much for doing this @KS!

SMM · Jun 20, 2020

Thank you KS :thup:

SMM · Jun 20, 2020

Arwenn said:
I am using an iPad, here are some screenshots of trying to open the updated file on Chrome & Mega (app).

View attachment 37092 View attachment 37093
Here is a screenshot of the older file opened in Mega.
View attachment 37094
Maybe the updated file is too big for Mega, but I dunno why it won’t open in Chrome for me.

Is it saved to the iPad?

Arwenn · Jun 20, 2020

SMM said:
Is it saved to the iPad?

I don’t think so. It downloads it as a zip file in a new window and asks which app I want to open it with.

Glenn · Jun 21, 2020

Wow, this is amazing!!! Excellent work... :thup:

seek10 · Jun 21, 2020

Arwenn said:
I don’t think so. It downloads it as a zip file in a new window and asks which app I want to open it with.
View attachment 37096

I was able to open In Texteditor , EPUB App but not In chrome in iPad.

goyacobol · Jun 21, 2020

goyacobol said:
@KS .

I was just checking it out and there may be some things to iron out.
Comparing the forum version to the html version I noticed some odd character insertions.

Session 11 August 1996:

A: No, it would be a “discover”.

Click to expand...

In the html version it is:

A: No, it would be a â€œdiscoverâ€.

Click to expand...

Doing some test searching with my own PDF versions there seems to be some sessions/items missed using the html version.

It may still be useful for some.

@KS,

Whatever changes you made has fixed the above inserted character problem. Thanks. I really like what you have created.

sid · Jun 21, 2020

Arwenn said:
I don’t think so. It downloads it as a zip file in a new window and asks which app I want to open it with.
View attachment 37096

On the iPad, when you have the file in focus, there is an option to “decompress” down below in the options-arrow which will extract the html file. You then open that html file.

Chris · Jun 21, 2020

KS said:
You definitely need to share your work. How did you managed to transform HTML content to the plain text? Readability, by hand or maybe both?

I used Beautiful Soup, a Python library which did most of the work of scraping and converting the HTML/XML and storing the transcripts into individual .txt files. But there were a few transcripts for which I had to change the regular expressions used to parse the session date correctly (as there are multiple formats for the dates given in the transcript) or for which I had to manually copy-paste the transcript into a text file.

I am attaching the source code written in (Python + Beautiful Soup library) as well as all the 365 session transcripts in .txt format that I have scraped so far. The naming convention I used for a session is yyyymmdd to make it easier to look up text within the files using Notepad++ and also to store the files on my hard disk in chronological order.

If anyone wants to search any text, download the attached zip file called "Cassiopaean Experiment transcripts" and extract it to any folder on your computer. The folder contains all the transcripts as separate text files. Download and install the software Notepad++. Open Notepad++ and click on "Search" button at the top and select "Find in Files". Copy the location of the "Cassiopaean Experiment transcripts" folder that you downloaded previously in the "Directory :" field and type the text that you want to search for in the "Find what :" field. Make sure to keep other search options the same as in the attached pic below. Screenshot of a sample search text "knowledge" and results below-

Using Notepad++ for search within transcripts.JPG

It gives the following search results -

If you click on any of the search results, it will open the transcript and go to the location of that text in the file. Hope this helps!

For those interested in the scraper, its source code is also attached as a separate zip file. Any faults in the scraper are my own and encourage feedback for the same since I'm not a full time programmer, but only have basic programming knowledge which I used to build this scraper. My only intention of building a scraper was to be able to search the transcripts more quickly and easily than the online tool on the forum, which is great as well. My sincere thanks to Laura and the crew for making the transcripts freely available to everyone.

Session transcripts as one big HTML file

Joe

Administrator

Joe

Administrator

3DStudent

The Living Force

KJS

Dagobah Resident

Trobar

Jedi

tschai

Dagobah Resident

Arwenn

Ambassador

SMM

The Living Force

SMM

The Living Force

Arwenn

Ambassador

Glenn

The Living Force

seek10

The Living Force

goyacobol

The Living Force

sid

Dagobah Resident

Chris

Jedi

Attachments