Is OldWitt in danger of being taken down soon?

Started by GV, March 23, 2021, 03:46:23 PM

Previous topic - Next topic

Miestră Schivă, UrN-GC

We had better start archiving. It turns out that the Government is going to be in financial surplus this term due to the Culture Minister having extra-Talossan calls on his time. If it will cost money to do a proper archive of OldWitt, controlled by us, let me know and we can PD something.

¡LADINTSCHIÇETZ-VOI - rogetz-mhe cacsa!
"They proved me right, they proved me wrong, but they could never last this long"

GV

Quote from: Miestră Schivă, UrN on March 23, 2021, 03:47:49 PM
We had better start archiving. It turns out that the Government is going to be in financial surplus this term due to the Culture Minister having extra-Talossan calls on his time. If it will cost money to do a proper archive of OldWitt, controlled by us, let me know and we can PD something.

My archives are voluminous, but if a native way to archive that forum to non-proprietary formats can be found, that is the dream.  I will reach out to the ProBoards devs.

GV

Quote from: Miestră Schivă, UrN on March 23, 2021, 03:47:49 PM
We had better start archiving. It turns out that the Government is going to be in financial surplus this term due to the Culture Minister having extra-Talossan calls on his time. If it will cost money to do a proper archive of OldWitt, controlled by us, let me know and we can PD something. 

What it will cost, if done properly, will be a gigantic, but short, burst of time to save every thread by brute-force methods to single-file .html, .jpg., .png, and .pdf (landscape orientation) - then upload the lot to a new and Talossa-dedicated Google Account as well to a Talossa-dedicated DropBox and possibly to Amazon Glacier.

No matter if ProBoards will ever or ever not have native archive functionality, a very rough estimate of the total filesize should be no more than 100 GB, but could be even more than that.

.html single-file is the most critical due to it be searchable.  Other file-formats guard against future tampering with the .html files as well as gives redundancy should one or more .html files becomes corrupt.

Frankly, such a project is beyond my skillset to put together in such a state it can be maintained indefinitely, and I would love to have someone take this task on!

I've put out appeals for help in this vein before, and have always been met with crickets and tumbleweeds.  There are people out there (not you, Miestrâ) who were they to have used their influence could have moved mountains, rallying hordes to my cause.  Yet they have heretofore refused the call.

Whatever I may have done with the Royal Archives, know this: it is to set an example for others to follow both now and in the future.  I do not intend to remain Royal Archivist for more than a few more years, hopefully, while I remain in Talossa perpetually to hold my own collection of Talossana in trust for the nation and on permanent loan to the Talossan Royal Archives.

Ultimately, I want to have set up a final and sustainable setup for the preservation of the physical archives I hold - a setup independent of my need to maintain its entirety and one that is not dependent on just one person for its entire preservation. 

It will take a village and all political stripes to make this happen.

Baron Alexandreu Davinescu

I'll run HTTrack again.  Got it started about an hour ago and I've grabbed 2,000 threads so far, but it's still less than a gig without all the images, so probably the whole thing will be small enough to put on a flash drive that I can mail to you.  It'll take a while to do this, but last time I backed up Witt it took about... a day I think?  Should be okay as long as my IP doesn't get blocked.
Alexandreu Davinescu, Baron Davinescu del Vilatx Freiric del Vilatx Freiric es Guaír del Sabor Talossan

                   

Baron Alexandreu Davinescu

10,000 total threads and 2,300 saved so far, so we're probably talking about a day and a half, maybe.  Low-hanging fruit was faster to grab I think, and now it's parsing deeper into the boards.  The 2k-odd Chat Room threads will be lost.  I'll keep you posted, GV.  I have a buddy with some servers going so maybe I'll see if he can host the copy.  I'll also see if D.N. might be willing to let me have the Republic Witt archive to get that hosted and accessible again, too.

I will not be batch-processing them into PDF, JPG, PNG (why two image formats?!), etc.  But once I've sent them to you, you can do that if you want.
Alexandreu Davinescu, Baron Davinescu del Vilatx Freiric del Vilatx Freiric es Guaír del Sabor Talossan

                   

GV

Quote from: Sir Alexandreu Davinescu on March 24, 2021, 12:52:52 PM
I'll run HTTrack again.  Got it started about an hour ago and I've grabbed 2,000 threads so far, but it's still less than a gig without all the images, so probably the whole thing will be small enough to put on a flash drive that I can mail to you.  It'll take a while to do this, but last time I backed up Witt it took about... a day I think?  Should be okay as long as my IP doesn't get blocked.

Thank you!  Last time you tried this, technical forces outside your control prevented. 

Upload everything to libraryoftalossa.com - I will send you a share-invite.

GV

Quote from: Sir Alexandreu Davinescu on March 24, 2021, 02:24:16 PM
10,000 total threads and 2,300 saved so far, so we're probably talking about a day and a half, maybe.  Low-hanging fruit was faster to grab I think, and now it's parsing deeper into the boards.  The 2k-odd Chat Room threads will be lost.  I'll keep you posted, GV.  I have a buddy with some servers going so maybe I'll see if he can host the copy.  I'll also see if D.N. might be willing to let me have the Republic Witt archive to get that hosted and accessible again, too.

I will not be batch-processing them into PDF, JPG, PNG (why two image formats?!), etc.  But once I've sent them to you, you can do that if you want.

Actually, there are 13,397 (give or take a few) threads plus the hundreds of pages of thread-links

GV

Quote from: Sir Alexandreu Davinescu on March 24, 2021, 02:24:16 PM
10,000 total threads and 2,300 saved so far, so we're probably talking about a day and a half, maybe.  Low-hanging fruit was faster to grab I think, and now it's parsing deeper into the boards.  The 2k-odd Chat Room threads will be lost.  I'll keep you posted, GV.  I have a buddy with some servers going so maybe I'll see if he can host the copy.  I'll also see if D.N. might be willing to let me have the Republic Witt archive to get that hosted and accessible again, too.

I will not be batch-processing them into PDF, JPG, PNG (why two image formats?!), etc.  But once I've sent them to you, you can do that if you want.

No worries re multiple formats.

MPF is also the person to ask about Republic forums.  They may be some issues with those.

Even if you only get so many of these threads, redundancy with this project is brilliant.

Baron Alexandreu Davinescu

Quote from: GV on March 24, 2021, 02:43:19 PM
Quote from: Sir Alexandreu Davinescu on March 24, 2021, 02:24:16 PM
10,000 total threads and 2,300 saved so far, so we're probably talking about a day and a half, maybe.  Low-hanging fruit was faster to grab I think, and now it's parsing deeper into the boards.  The 2k-odd Chat Room threads will be lost.  I'll keep you posted, GV.  I have a buddy with some servers going so maybe I'll see if he can host the copy.  I'll also see if D.N. might be willing to let me have the Republic Witt archive to get that hosted and accessible again, too.

I will not be batch-processing them into PDF, JPG, PNG (why two image formats?!), etc.  But once I've sent them to you, you can do that if you want.

Actually, there are 13,397 (give or take a few) threads plus the hundreds of pages of thread-links
The Chat Room threads will not be included since they're behind a login.  So what gets saved will be absent those 2,000 or so threads.  About 10k pages written so far for this backup, including 2,675 threads.
Alexandreu Davinescu, Baron Davinescu del Vilatx Freiric del Vilatx Freiric es Guaír del Sabor Talossan

                   

Baron Alexandreu Davinescu

3,756 threads so far in 17,432 files for a total of 1.5 gigs.  About a third of the way there.  So far, so good.  Everything seems navigable as normal, too, except that you have to manually advance the page on a board archive (the php for typing in the page number doesn't work yet, although it's probably fixable).
Alexandreu Davinescu, Baron Davinescu del Vilatx Freiric del Vilatx Freiric es Guaír del Sabor Talossan

                   

Baron Alexandreu Davinescu

9,949 threads in 53,209 files, totaling 3.89 gigs.  Nearing the end now.  Took a quick glance through at some multipage threads in about a dozen pages deep on the main board, and they all seem to work fine.
Alexandreu Davinescu, Baron Davinescu del Vilatx Freiric del Vilatx Freiric es Guaír del Sabor Talossan

                   

GV

Quote from: Sir Alexandreu Davinescu on March 25, 2021, 01:53:40 PM
9,949 threads in 53,209 files, totaling 3.89 gigs.  Nearing the end now.  Took a quick glance through at some multipage threads in about a dozen pages deep on the main board, and they all seem to work fine.

There is a 'listening to' thread of about 40-50 pages lol.

Baron Alexandreu Davinescu

Song club, TMT, etc will all be excluded since they're in the Chat Room.  I'll probably fiddle with the script a little and try to get those threads after this run is done, though.  My only concern is that I'll have to then figure out how to exclude a ton of many years of private messages then, too.
Alexandreu Davinescu, Baron Davinescu del Vilatx Freiric del Vilatx Freiric es Guaír del Sabor Talossan

                   

Danihel Txechescu

Quote from: Sir Alexandreu Davinescu on March 25, 2021, 04:23:40 PM
Song club, TMT, etc will all be excluded since they're in the Chat Room.  I'll probably fiddle with the script a little and try to get those threads after this run is done, though.  My only concern is that I'll have to then figure out how to exclude a ton of many years of private messages then, too.
I could take it from there if need be, I guess, as my participation on Wittspace has been very much limited.

(Also, great that you found a tool to do all the web scraping! Kudos on this.)

Baron Alexandreu Davinescu

They all seem to be located in one directory actually, so it should be simple to just delete it.  That's assuming I can scrape while logged-in at all, of course.  Kind of amazed I grabbed 5 gigs of data without getting blocked already, actually.

GV, I can upload this to Drive when it's done if you want, but don't be disappointed when you can't view it.  It'll never be conveniently viewable from Drive, since it's a full HTML backup and Drive doesn't support interlinking between files like that.  So you will be able to go through and view individual threads by manually selecting them by numbered directory, but not just through normal navigation.  In fact, until I go through and batch edit it to change the links so that they don't specifically reference the full location on my own computer, it won't work for you as normal even if you download it (since your computer would follow a link to D:/PC/FolderName/BlahBlah/thread/736737.html, but wouldn't find anything there).  I think I can write that code without too much trouble, though, and I'll be sure to make a backup on the Drive before I start fiddling with it.  I think your days of worrying about losing this data are over.  You'll be left with nothing to do (although again it would be good to be able to host both Republic's Witt and this one on the same server for posterity).
Alexandreu Davinescu, Baron Davinescu del Vilatx Freiric del Vilatx Freiric es Guaír del Sabor Talossan

                   

Baron Alexandreu Davinescu

Side note: this is one of the reasons Talossa is so fun.  You can just dive into a task that you'd never have any reason to do otherwise, and learn some new skills on the way!  I've gotten so much out of Talossa this way.
Alexandreu Davinescu, Baron Davinescu del Vilatx Freiric del Vilatx Freiric es Guaír del Sabor Talossan

                   

Baron Alexandreu Davinescu

Okay, all downloaded!  Zipping now and then I'll upload it.  11,710 threads in total.  It'll take probably about a half hour or so to zip, since it's 6 gigs and 64,128 files.  And then it'll take a while to upload.  As you might guess, my computer and my internet connection aren't top of the line.  But by morning, GV, you will be the proud owner of a genuine and reasonably complete backup of Wittenberg.
Alexandreu Davinescu, Baron Davinescu del Vilatx Freiric del Vilatx Freiric es Guaír del Sabor Talossan

                   

GV

Quote from: Sir Alexandreu Davinescu on March 25, 2021, 07:22:22 PM
Side note: this is one of the reasons Talossa is so fun.  You can just dive into a task that you'd never have any reason to do otherwise, and learn some new skills on the way!  I've gotten so much out of Talossa this way.

You and me both, Alexander.  Back in the day, I was doing horrible websites and through Talossa, I've kept up my writing skills and gotten a view onto the world I never, ever would have gotten otherwise. 

GV

Quote from: Sir Alexandreu Davinescu on March 25, 2021, 07:37:39 PM
Okay, all downloaded!  Zipping now and then I'll upload it.  11,710 threads in total.  It'll take probably about a half hour or so to zip, since it's 6 gigs and 64,128 files.  And then it'll take a while to upload.  As you might guess, my computer and my internet connection aren't top of the line.  But by morning, GV, you will be the proud owner of a genuine and reasonably complete backup of Wittenberg.

Thank you!!  What I will do is go through and make sure all threads are represented.  This will not take too long.

As for internal file-linkage, I've always had the expectation people would have to do the searching for whatever on OldWitt by hand. 

If the batch-editing messes with the original thread numbers, don't do the batch editing.  The thread numbers are critical to future citation and research.