Check out more details at https://wittenberg.talossa.com/index.php?topic=730.new#new
GV, Royal Archivist
We had better start archiving. It turns out that the Government is going to be in financial surplus this term due to the Culture Minister having extra-Talossan calls on his time. If it will cost money to do a proper archive of OldWitt, controlled by us, let me know and we can PD something.
Quote from: Miestră Schivă, UrN on March 23, 2021, 03:47:49 PM
We had better start archiving. It turns out that the Government is going to be in financial surplus this term due to the Culture Minister having extra-Talossan calls on his time. If it will cost money to do a proper archive of OldWitt, controlled by us, let me know and we can PD something.
My archives are voluminous, but if a native way to archive that forum to non-proprietary formats can be found, that is the dream. I will reach out to the ProBoards devs.
Quote from: Miestră Schivă, UrN on March 23, 2021, 03:47:49 PM
We had better start archiving. It turns out that the Government is going to be in financial surplus this term due to the Culture Minister having extra-Talossan calls on his time. If it will cost money to do a proper archive of OldWitt, controlled by us, let me know and we can PD something.
What it will cost, if done properly, will be a gigantic, but short, burst of time to save every thread by brute-force methods to single-file .html, .jpg., .png, and .pdf (landscape orientation) - then upload the lot to a new and Talossa-dedicated Google Account as well to a Talossa-dedicated DropBox and possibly to Amazon Glacier.
No matter if ProBoards will ever or ever not have native archive functionality, a very rough estimate of the total filesize should be no more than 100 GB, but could be even more than that.
.html single-file is the most critical due to it be searchable. Other file-formats guard against future tampering with the .html files as well as gives redundancy should one or more .html files becomes corrupt.
Frankly, such a project is beyond my skillset to put together in such a state it can be maintained indefinitely, and I would love to have someone take this task on!
I've put out appeals for help in this vein before, and have always been met with crickets and tumbleweeds. There are people out there (not you, Miestrâ) who were they to have used their influence could have moved mountains, rallying hordes to my cause. Yet they have heretofore refused the call.
Whatever I may have done with the Royal Archives, know this: it is to set an example for others to follow both now and in the future. I do not intend to remain Royal Archivist for more than a few more years, hopefully, while I remain in Talossa perpetually to hold my own collection of Talossana in trust for the nation and on permanent loan to the Talossan Royal Archives.
Ultimately, I want to have set up a final and sustainable setup for the preservation of the physical archives I hold - a setup independent of my need to maintain its entirety and one that is not dependent on just one person for its entire preservation.
It will take a village and all political stripes to make this happen.
I'll run HTTrack again. Got it started about an hour ago and I've grabbed 2,000 threads so far, but it's still less than a gig without all the images, so probably the whole thing will be small enough to put on a flash drive that I can mail to you. It'll take a while to do this, but last time I backed up Witt it took about... a day I think? Should be okay as long as my IP doesn't get blocked.
10,000 total threads and 2,300 saved so far, so we're probably talking about a day and a half, maybe. Low-hanging fruit was faster to grab I think, and now it's parsing deeper into the boards. The 2k-odd Chat Room threads will be lost. I'll keep you posted, GV. I have a buddy with some servers going so maybe I'll see if he can host the copy. I'll also see if D.N. might be willing to let me have the Republic Witt archive to get that hosted and accessible again, too.
I will not be batch-processing them into PDF, JPG, PNG (why two image formats?!), etc. But once I've sent them to you, you can do that if you want.
Quote from: Sir Alexandreu Davinescu on March 24, 2021, 12:52:52 PM
I'll run HTTrack again. Got it started about an hour ago and I've grabbed 2,000 threads so far, but it's still less than a gig without all the images, so probably the whole thing will be small enough to put on a flash drive that I can mail to you. It'll take a while to do this, but last time I backed up Witt it took about... a day I think? Should be okay as long as my IP doesn't get blocked.
Thank you! Last time you tried this, technical forces outside your control prevented.
Upload everything to libraryoftalossa.com - I will send you a share-invite.
Quote from: Sir Alexandreu Davinescu on March 24, 2021, 02:24:16 PM
10,000 total threads and 2,300 saved so far, so we're probably talking about a day and a half, maybe. Low-hanging fruit was faster to grab I think, and now it's parsing deeper into the boards. The 2k-odd Chat Room threads will be lost. I'll keep you posted, GV. I have a buddy with some servers going so maybe I'll see if he can host the copy. I'll also see if D.N. might be willing to let me have the Republic Witt archive to get that hosted and accessible again, too.
I will not be batch-processing them into PDF, JPG, PNG (why two image formats?!), etc. But once I've sent them to you, you can do that if you want.
Actually, there are 13,397 (give or take a few) threads plus the hundreds of pages of thread-links
Quote from: Sir Alexandreu Davinescu on March 24, 2021, 02:24:16 PM
10,000 total threads and 2,300 saved so far, so we're probably talking about a day and a half, maybe. Low-hanging fruit was faster to grab I think, and now it's parsing deeper into the boards. The 2k-odd Chat Room threads will be lost. I'll keep you posted, GV. I have a buddy with some servers going so maybe I'll see if he can host the copy. I'll also see if D.N. might be willing to let me have the Republic Witt archive to get that hosted and accessible again, too.
I will not be batch-processing them into PDF, JPG, PNG (why two image formats?!), etc. But once I've sent them to you, you can do that if you want.
No worries re multiple formats.
MPF is also the person to ask about Republic forums. They may be some issues with those.
Even if you only get so many of these threads, redundancy with this project is brilliant.
Quote from: GV on March 24, 2021, 02:43:19 PM
Quote from: Sir Alexandreu Davinescu on March 24, 2021, 02:24:16 PM
10,000 total threads and 2,300 saved so far, so we're probably talking about a day and a half, maybe. Low-hanging fruit was faster to grab I think, and now it's parsing deeper into the boards. The 2k-odd Chat Room threads will be lost. I'll keep you posted, GV. I have a buddy with some servers going so maybe I'll see if he can host the copy. I'll also see if D.N. might be willing to let me have the Republic Witt archive to get that hosted and accessible again, too.
I will not be batch-processing them into PDF, JPG, PNG (why two image formats?!), etc. But once I've sent them to you, you can do that if you want.
Actually, there are 13,397 (give or take a few) threads plus the hundreds of pages of thread-links
The Chat Room threads will not be included since they're behind a login. So what gets saved will be absent those 2,000 or so threads. About 10k pages written so far for this backup, including 2,675 threads.
3,756 threads so far in 17,432 files for a total of 1.5 gigs. About a third of the way there. So far, so good. Everything seems navigable as normal, too, except that you have to manually advance the page on a board archive (the php for typing in the page number doesn't work yet, although it's probably fixable).
9,949 threads in 53,209 files, totaling 3.89 gigs. Nearing the end now. Took a quick glance through at some multipage threads in about a dozen pages deep on the main board, and they all seem to work fine.
Quote from: Sir Alexandreu Davinescu on March 25, 2021, 01:53:40 PM
9,949 threads in 53,209 files, totaling 3.89 gigs. Nearing the end now. Took a quick glance through at some multipage threads in about a dozen pages deep on the main board, and they all seem to work fine.
There is a 'listening to' thread of about 40-50 pages lol.
Song club, TMT, etc will all be excluded since they're in the Chat Room. I'll probably fiddle with the script a little and try to get those threads after this run is done, though. My only concern is that I'll have to then figure out how to exclude a ton of many years of private messages then, too.
Quote from: Sir Alexandreu Davinescu on March 25, 2021, 04:23:40 PM
Song club, TMT, etc will all be excluded since they're in the Chat Room. I'll probably fiddle with the script a little and try to get those threads after this run is done, though. My only concern is that I'll have to then figure out how to exclude a ton of many years of private messages then, too.
I could take it from there if need be, I guess, as my participation on Wittspace has been very much limited.
(Also, great that you found a tool to do all the web scraping! Kudos on this.)
They all seem to be located in one directory actually, so it should be simple to just delete it. That's assuming I can scrape while logged-in at all, of course. Kind of amazed I grabbed 5 gigs of data without getting blocked already, actually.
GV, I can upload this to Drive when it's done if you want, but don't be disappointed when you can't view it. It'll never be conveniently viewable from Drive, since it's a full HTML backup and Drive doesn't support interlinking between files like that. So you will be able to go through and view individual threads by manually selecting them by numbered directory, but not just through normal navigation. In fact, until I go through and batch edit it to change the links so that they don't specifically reference the full location on my own computer, it won't work for you as normal even if you download it (since your computer would follow a link to D:/PC/FolderName/BlahBlah/thread/736737.html, but wouldn't find anything there). I think I can write that code without too much trouble, though, and I'll be sure to make a backup on the Drive before I start fiddling with it. I think your days of worrying about losing this data are over. You'll be left with nothing to do (although again it would be good to be able to host both Republic's Witt and this one on the same server for posterity).
Side note: this is one of the reasons Talossa is so fun. You can just dive into a task that you'd never have any reason to do otherwise, and learn some new skills on the way! I've gotten so much out of Talossa this way.
Okay, all downloaded! Zipping now and then I'll upload it. 11,710 threads in total. It'll take probably about a half hour or so to zip, since it's 6 gigs and 64,128 files. And then it'll take a while to upload. As you might guess, my computer and my internet connection aren't top of the line. But by morning, GV, you will be the proud owner of a genuine and reasonably complete backup of Wittenberg.
Quote from: Sir Alexandreu Davinescu on March 25, 2021, 07:22:22 PM
Side note: this is one of the reasons Talossa is so fun. You can just dive into a task that you'd never have any reason to do otherwise, and learn some new skills on the way! I've gotten so much out of Talossa this way.
You and me both, Alexander. Back in the day, I was doing horrible websites and through Talossa, I've kept up my writing skills and gotten a view onto the world I never, ever would have gotten otherwise.
Quote from: Sir Alexandreu Davinescu on March 25, 2021, 07:37:39 PM
Okay, all downloaded! Zipping now and then I'll upload it. 11,710 threads in total. It'll take probably about a half hour or so to zip, since it's 6 gigs and 64,128 files. And then it'll take a while to upload. As you might guess, my computer and my internet connection aren't top of the line. But by morning, GV, you will be the proud owner of a genuine and reasonably complete backup of Wittenberg.
Thank you!! What I will do is go through and make sure all threads are represented. This will not take too long.
As for internal file-linkage, I've always had the expectation people would have to do the searching for whatever on OldWitt by hand.
If the batch-editing messes with the original thread numbers, don't do the batch editing. The thread numbers are critical to future citation and research.
Quote from: Sir Alexandreu Davinescu on March 25, 2021, 07:37:39 PM
Okay, all downloaded! Zipping now and then I'll upload it. 11,710 threads in total. It'll take probably about a half hour or so to zip, since it's 6 gigs and 64,128 files. And then it'll take a while to upload. As you might guess, my computer and my internet connection aren't top of the line. But by morning, GV, you will be the proud owner of a genuine and reasonably complete backup of Wittenberg.
Tech seems to have caught up with our advanced needs in Talossa.
Quote from: GV on March 25, 2021, 08:16:35 PM
Quote from: Sir Alexandreu Davinescu on March 25, 2021, 07:37:39 PM
Okay, all downloaded! Zipping now and then I'll upload it. 11,710 threads in total. It'll take probably about a half hour or so to zip, since it's 6 gigs and 64,128 files. And then it'll take a while to upload. As you might guess, my computer and my internet connection aren't top of the line. But by morning, GV, you will be the proud owner of a genuine and reasonably complete backup of Wittenberg.
Thank you!! What I will do is go through and make sure all threads are represented. This will not take too long.
As for internal file-linkage, I've always had the expectation people would have to do the searching for whatever on OldWitt by hand.
If the batch-editing messes with the original thread numbers, don't do the batch editing. The thread numbers are critical to future citation and research.
11,710 threads is a lot to check, since you have to open each one manually. If you are able to open and check each thread at a rate of one every five seconds without ever stopping or slowing down, that's more than sixteen straight hours of checking! Unless there's some urgent need, I'd suggest it might save you a lot of time if you just gave me a little bit to sort it out so that it's navigable. I don't know if I'll be able to get fancy stuff like searching working, but I bet I can get it a lot better-sorted for you than the current state.
In an hour it'll all be uploaded, compressed to 1 gig.
Restarting the upload because I figured out the batch edit thing pretty quickly. I think this should work on your computer now. Once you unzip the 7z file, then you want to open up the Witt2 folder, then the talossa.proboards.com folder. Inside of that folder is a file named index.html, and it should open normally with any web browser. You can navigate through any page that the crawler could access, and the links should be purely relative to their location and operate normally. In order to advance through multiple pages in a board with a large size, like the main one, you need to use the "next" link (or you can manually edit the URL to the intended number). If you click the ellipsis, the dialogue box to pick a page will come up, but it won't work. No external links were copied, so any links to other sites outside of talossa.proboards.com will be dead (but there will be an error page to point you to a live version, if one exists). No images were downloaded, but that'll be something I can try in the future (why not, after all?)
To be clear, I've only done spot checks here and there, but I didn't find any missing pages. If something's not working or missing, let me know and I'll see if I can figure it out.
EDIT: Uploaded and sorted. Enjoy.
Stunning. I will take a look at everything this coming week.
Working on grabbing images, but running into some problems with that amount of data. Working on it, but expect nothing in the near term. Seems pretty low priority anyway.
Everything work okay for you? I know you'd been wanting this for years now, so I hope that this whole thing wasn't anticlimactic.
Quote from: Sir Alexandreu Davinescu on April 05, 2021, 09:53:23 PM
Working on grabbing images, but running into some problems with that amount of data. Working on it, but expect nothing in the near term. Seems pretty low priority anyway.
Everything work okay for you? I know you'd been wanting this for years now, so I hope that this whole thing wasn't anticlimactic.
Argh. This got buried with other stuff I'm working on. Alexander, I'll let you know on this by the end of this month. Thanks already for an amazing amount of work on this.
What I will need to do is check to see if every thread number (save Chat) is covered. Once I'm sure we can open everything without prioprietary software, I'll call this project done.
No rush at all; I was just curious. Take your time and whenever you get around to it, let me know if it works or if something's broken.
Quote from: Sir Alexandreu Davinescu on April 05, 2021, 11:10:38 PM
No rush at all; I was just curious. Take your time and whenever you get around to it, let me know if it works or if something's broken.
Sounds good. Thanks again!
I haven't been able to get the images yet, but I'm still working on this. Just FYI.
Quote from: Sir Alexandreu Davinescu on April 20, 2021, 07:51:56 AM
I haven't been able to get the images yet, but I'm still working on this. Just FYI.
TY for you continuing efforts!
Doesn't look like I'll be able to grab the images, since I keep getting shut down by Proboards. Did the existing archive work?
Quote from: GV on April 05, 2021, 10:28:55 PM
Quote from: Sir Alexandreu Davinescu on April 05, 2021, 09:53:23 PM
Working on grabbing images, but running into some problems with that amount of data. Working on it, but expect nothing in the near term. Seems pretty low priority anyway.
Everything work okay for you? I know you'd been wanting this for years now, so I hope that this whole thing wasn't anticlimactic.
Argh. This got buried with other stuff I'm working on. Alexander, I'll let you know on this by the end of this month. Thanks already for an amazing amount of work on this.
Not to be a pest, but I spent a decent amount of time getting this set up and working, and I just want to verify that the archive did work okay. It's been a few months -- have you had a chance to check it?
Quote from: GV on April 20, 2021, 12:02:25 PM
Quote from: Sir Alexandreu Davinescu on April 20, 2021, 07:51:56 AM
I haven't been able to get the images yet, but I'm still working on this. Just FYI.
TY for you continuing efforts!
Quote from: Baron Alexandreu Davinescu on July 28, 2021, 10:51:19 PM
Quote from: GV on April 05, 2021, 10:28:55 PM
Quote from: Sir Alexandreu Davinescu on April 05, 2021, 09:53:23 PM
Working on grabbing images, but running into some problems with that amount of data. Working on it, but expect nothing in the near term. Seems pretty low priority anyway.
Everything work okay for you? I know you'd been wanting this for years now, so I hope that this whole thing wasn't anticlimactic.
Argh. This got buried with other stuff I'm working on. Alexander, I'll let you know on this by the end of this month. Thanks already for an amazing amount of work on this.
Not to be a pest, but I spent a decent amount of time getting this set up and working, and I just want to verify that the archive did work okay. It's been a few months -- have you had a chance to check it?
Maybe you missed this,
@GV ?
Extracting the near-one-gig compressed archive now...
Assuming it works fine (as it should with you having put it together), I will upload it to the Library of Talossa.
Quote from: GV on August 04, 2021, 10:55:31 PM
Extracting the near-one-gig compressed archive now...
Assuming it works fine (as it should with you having put it together), I will upload it to the Library of Talossa.
You can do that, but it won't really work or be navigable. I recommend that for public access you'll leave it as a compressed file that people can download to their own computers and extract as they so choose. Once it's on someone's computer, then the local links should all work properly and they should be able to just click through as though it were a live site, but that's not possible on Google drive. On Google drive, it will just be a giant collection of unlabeled HTML files in different folders, and thus pretty much useless.
I just want to confirm that you are able to download it, extract it, and then navigate normally through all the different boards at different levels without any trouble or encountering any missing threads?
So in the last five months, were you able to download this archive, extract it, and then navigate normally without any trouble or any missing threads? You were working on this for years, and I'd still like to confirm that it actually works for you.
Quote from: GV on August 04, 2021, 10:55:31 PM
Assuming it works fine (as it should with you having put it together)
Did you ever have a chance to check to make sure that the archive extracts correctly, and that you can navigate through to all the forums and look at different threads?
I've been keeping my copy for a long while now in case I need to tweak it, but I don't have the space since it's so large. But I don't want to delete it if something is wrong with the uploaded copy. It's been more than half a year that I've been asking you this, GV -- please find the time to check this out and make sure it's working.
Okay, well, I guess I'm just going to delete this file and hope that the archive really does work for you,
@GV . You've been on Witt multiple times over the last
eight months, and I've tagged you several times, so pretty clearly you're just purposefully ignoring me -- and that's really rude. Especially considering how I busted my butt to do this for you and save you from manually saving each thread of the old forum, one at a time, like you'd been doing.
@GV , it's been a year or so. Did you ever got around to checking to see if the archive works for you?