Is OldWitt in danger of being taken down soon?

Started by GV, March 23, 2021, 03:46:23 PM

Previous topic - Next topic

Miestră Schivă, UrN-GC

We had better start archiving. It turns out that the Government is going to be in financial surplus this term due to the Culture Minister having extra-Talossan calls on his time. If it will cost money to do a proper archive of OldWitt, controlled by us, let me know and we can PD something.

¡LADINTSCHIÇETZ-VOI - rogetz-mhe cacsa!
"They proved me right, they proved me wrong, but they could never last this long"

GV

Quote from: Miestră Schivă, UrN on March 23, 2021, 03:47:49 PM
We had better start archiving. It turns out that the Government is going to be in financial surplus this term due to the Culture Minister having extra-Talossan calls on his time. If it will cost money to do a proper archive of OldWitt, controlled by us, let me know and we can PD something.

My archives are voluminous, but if a native way to archive that forum to non-proprietary formats can be found, that is the dream.  I will reach out to the ProBoards devs.

GV

Quote from: Miestră Schivă, UrN on March 23, 2021, 03:47:49 PM
We had better start archiving. It turns out that the Government is going to be in financial surplus this term due to the Culture Minister having extra-Talossan calls on his time. If it will cost money to do a proper archive of OldWitt, controlled by us, let me know and we can PD something. 

What it will cost, if done properly, will be a gigantic, but short, burst of time to save every thread by brute-force methods to single-file .html, .jpg., .png, and .pdf (landscape orientation) - then upload the lot to a new and Talossa-dedicated Google Account as well to a Talossa-dedicated DropBox and possibly to Amazon Glacier.

No matter if ProBoards will ever or ever not have native archive functionality, a very rough estimate of the total filesize should be no more than 100 GB, but could be even more than that.

.html single-file is the most critical due to it be searchable.  Other file-formats guard against future tampering with the .html files as well as gives redundancy should one or more .html files becomes corrupt.

Frankly, such a project is beyond my skillset to put together in such a state it can be maintained indefinitely, and I would love to have someone take this task on!

I've put out appeals for help in this vein before, and have always been met with crickets and tumbleweeds.  There are people out there (not you, Miestrâ) who were they to have used their influence could have moved mountains, rallying hordes to my cause.  Yet they have heretofore refused the call.

Whatever I may have done with the Royal Archives, know this: it is to set an example for others to follow both now and in the future.  I do not intend to remain Royal Archivist for more than a few more years, hopefully, while I remain in Talossa perpetually to hold my own collection of Talossana in trust for the nation and on permanent loan to the Talossan Royal Archives.

Ultimately, I want to have set up a final and sustainable setup for the preservation of the physical archives I hold - a setup independent of my need to maintain its entirety and one that is not dependent on just one person for its entire preservation. 

It will take a village and all political stripes to make this happen.

Baron Alexandreu Davinescu

I'll run HTTrack again.  Got it started about an hour ago and I've grabbed 2,000 threads so far, but it's still less than a gig without all the images, so probably the whole thing will be small enough to put on a flash drive that I can mail to you.  It'll take a while to do this, but last time I backed up Witt it took about... a day I think?  Should be okay as long as my IP doesn't get blocked.
Alexandreu Davinescu, Baron Davinescu del Vilatx Freiric del Vilatx Freiric es Guaír del Sabor Talossan

                   

Baron Alexandreu Davinescu

10,000 total threads and 2,300 saved so far, so we're probably talking about a day and a half, maybe.  Low-hanging fruit was faster to grab I think, and now it's parsing deeper into the boards.  The 2k-odd Chat Room threads will be lost.  I'll keep you posted, GV.  I have a buddy with some servers going so maybe I'll see if he can host the copy.  I'll also see if D.N. might be willing to let me have the Republic Witt archive to get that hosted and accessible again, too.

I will not be batch-processing them into PDF, JPG, PNG (why two image formats?!), etc.  But once I've sent them to you, you can do that if you want.
Alexandreu Davinescu, Baron Davinescu del Vilatx Freiric del Vilatx Freiric es Guaír del Sabor Talossan

                   

GV

Quote from: Sir Alexandreu Davinescu on March 24, 2021, 12:52:52 PM
I'll run HTTrack again.  Got it started about an hour ago and I've grabbed 2,000 threads so far, but it's still less than a gig without all the images, so probably the whole thing will be small enough to put on a flash drive that I can mail to you.  It'll take a while to do this, but last time I backed up Witt it took about... a day I think?  Should be okay as long as my IP doesn't get blocked.

Thank you!  Last time you tried this, technical forces outside your control prevented. 

Upload everything to libraryoftalossa.com - I will send you a share-invite.

GV

Quote from: Sir Alexandreu Davinescu on March 24, 2021, 02:24:16 PM
10,000 total threads and 2,300 saved so far, so we're probably talking about a day and a half, maybe.  Low-hanging fruit was faster to grab I think, and now it's parsing deeper into the boards.  The 2k-odd Chat Room threads will be lost.  I'll keep you posted, GV.  I have a buddy with some servers going so maybe I'll see if he can host the copy.  I'll also see if D.N. might be willing to let me have the Republic Witt archive to get that hosted and accessible again, too.

I will not be batch-processing them into PDF, JPG, PNG (why two image formats?!), etc.  But once I've sent them to you, you can do that if you want.

Actually, there are 13,397 (give or take a few) threads plus the hundreds of pages of thread-links

GV

Quote from: Sir Alexandreu Davinescu on March 24, 2021, 02:24:16 PM
10,000 total threads and 2,300 saved so far, so we're probably talking about a day and a half, maybe.  Low-hanging fruit was faster to grab I think, and now it's parsing deeper into the boards.  The 2k-odd Chat Room threads will be lost.  I'll keep you posted, GV.  I have a buddy with some servers going so maybe I'll see if he can host the copy.  I'll also see if D.N. might be willing to let me have the Republic Witt archive to get that hosted and accessible again, too.

I will not be batch-processing them into PDF, JPG, PNG (why two image formats?!), etc.  But once I've sent them to you, you can do that if you want.

No worries re multiple formats.

MPF is also the person to ask about Republic forums.  They may be some issues with those.

Even if you only get so many of these threads, redundancy with this project is brilliant.

Baron Alexandreu Davinescu

Quote from: GV on March 24, 2021, 02:43:19 PM
Quote from: Sir Alexandreu Davinescu on March 24, 2021, 02:24:16 PM
10,000 total threads and 2,300 saved so far, so we're probably talking about a day and a half, maybe.  Low-hanging fruit was faster to grab I think, and now it's parsing deeper into the boards.  The 2k-odd Chat Room threads will be lost.  I'll keep you posted, GV.  I have a buddy with some servers going so maybe I'll see if he can host the copy.  I'll also see if D.N. might be willing to let me have the Republic Witt archive to get that hosted and accessible again, too.

I will not be batch-processing them into PDF, JPG, PNG (why two image formats?!), etc.  But once I've sent them to you, you can do that if you want.

Actually, there are 13,397 (give or take a few) threads plus the hundreds of pages of thread-links
The Chat Room threads will not be included since they're behind a login.  So what gets saved will be absent those 2,000 or so threads.  About 10k pages written so far for this backup, including 2,675 threads.
Alexandreu Davinescu, Baron Davinescu del Vilatx Freiric del Vilatx Freiric es Guaír del Sabor Talossan

                   

Baron Alexandreu Davinescu

3,756 threads so far in 17,432 files for a total of 1.5 gigs.  About a third of the way there.  So far, so good.  Everything seems navigable as normal, too, except that you have to manually advance the page on a board archive (the php for typing in the page number doesn't work yet, although it's probably fixable).
Alexandreu Davinescu, Baron Davinescu del Vilatx Freiric del Vilatx Freiric es Guaír del Sabor Talossan

                   

Baron Alexandreu Davinescu

9,949 threads in 53,209 files, totaling 3.89 gigs.  Nearing the end now.  Took a quick glance through at some multipage threads in about a dozen pages deep on the main board, and they all seem to work fine.
Alexandreu Davinescu, Baron Davinescu del Vilatx Freiric del Vilatx Freiric es Guaír del Sabor Talossan

                   

GV

Quote from: Sir Alexandreu Davinescu on March 25, 2021, 01:53:40 PM
9,949 threads in 53,209 files, totaling 3.89 gigs.  Nearing the end now.  Took a quick glance through at some multipage threads in about a dozen pages deep on the main board, and they all seem to work fine.

There is a 'listening to' thread of about 40-50 pages lol.

Baron Alexandreu Davinescu

Song club, TMT, etc will all be excluded since they're in the Chat Room.  I'll probably fiddle with the script a little and try to get those threads after this run is done, though.  My only concern is that I'll have to then figure out how to exclude a ton of many years of private messages then, too.
Alexandreu Davinescu, Baron Davinescu del Vilatx Freiric del Vilatx Freiric es Guaír del Sabor Talossan

                   

Danihel Txechescu

Quote from: Sir Alexandreu Davinescu on March 25, 2021, 04:23:40 PM
Song club, TMT, etc will all be excluded since they're in the Chat Room.  I'll probably fiddle with the script a little and try to get those threads after this run is done, though.  My only concern is that I'll have to then figure out how to exclude a ton of many years of private messages then, too.
I could take it from there if need be, I guess, as my participation on Wittspace has been very much limited.

(Also, great that you found a tool to do all the web scraping! Kudos on this.)