Hi everyone.
I hope everyone is having a good day/night. There is a good reason I want to have a offline copy of this site. With all the wars and uncertantities going on around the world then it would make me unhappy to loose access to this site in the event the internet goes away permanently. Your website is wonderful and i enjoy going through it. I am a maintainer of a small distro of debian linux and I use resources from this site and other. It would make things difficult if I loose access to this site. Can anyone tell me if it is possible to have a offline copy I could run on my home computer? Maybe a zim file.
My custom distro in case anyone wants to use it. Currently working on a new version of the dvd and it will be based on debian 12. Alot of goodies coming.
https://kjv-endtimes.neocities.org/
Thanks in advance for any help.
Nomad
I would like this, but it is low feasibility with our current level of human resources. Each snapshot would have to be manually produced each time, and it can't just be a copy of the site files & database: we would not be able to share sensitive data like user account information, so such data would have to be filtered out.
Then again, I don't really know how wikipedia handles it. If there are tools to facilitate that, I'm willing to listen. Just know that this site is not based on a MediaWiki framework.
--Medicine Storm
If your site is based on php then you could look at the source and make a detached copy without the user logins. It seems the downloaded files are zip, mid, mp3 etc. If the license for each file is on the file system then that could be compressed and shared without needing the main site. It might be possible to do a bash script to comb through the files and build individual html files and link it into a index.html. I do not know what the site structure looks like and I am only guessing. Maybe my suggestions might help. Are you using a modified cms? Thaks for replying. Be sure to check out my custom linux. It has ufw install and automaticially configure to block all incomming. Enjoy.
Appreciate the shoutout! It's awesome to hear that you're finding our site helpful. I totally get where you're coming from with wanting an offline copy, especially with all the craziness going on. While I'm not sure about zim files specifically, have you checked out tools like HTTrack? They're handy for grabbing websites and saving them locally for offline browsing. And your distro sounds intriguing—can't wait to see what goodies you have in store with Debian 12!
Hi. I know about httrack. I do not want to disrespect the website owner by hitting his servers with so many requests. This is the reason I ask if he has an offline option that i can download. I do not need the site contents. I just want the files with the necessary copyright (cc0, cc1 etc) license file. That way I could always search through the file system to see what I want. I am building a webscraper for survivorlibrary because that site is straight forward to understand and I use a method that checks all the files against what I have on disk. This tool will be on the next release of the dvd. This way I only download what is new which tends to be very small. The problem with opengameart site is it has a cms and that crates a problem. I have to wait and see what happens. At least I want to respect the owner.
The respect is appreciated. Scraping the site would result in an automated IP restriction. I'll work with you in any way I can, though. Please let me know what sort of resources you would like and I'll see what is available.
--Medicine Storm
Hi. Let me explain who I am. I consider myself a person who deblieves in preserving data for the bad times to come. I know the internet is goign to be taken down. Its only a matter of time with the evil leaders we have in this world. Preserving information so others can access it is very important. This is the reason I am making a tool (using lazarus ide) to download pdf files from survivors library showing respect to the owner of the site. Your site is complicated because it a cms. I am assuming you have the data files (mp3, zip etc) on the physical disk. If you are using linux then you could make a html files with the links to the files on the file system that I could download that file. This way I could check the html files and see if I already have the files. Only download the files I do not have. html files should not be a hassle on the servers. It should be simple.
having the files by themselves are incomplete. The files are unusable without the license and attribution that goes with them. To preserve the information, you must preserve the files as well as the license, author, and what other assets the files were derived from.
--Medicine Storm
Yes. That is what I said in one of my earlier comments. I need the license file (txt) with the content (mp3, mid, zip, etc). Maybe you could make the html files by the user account that put them up. That wayyou could potentially preserve everything in place.
We have a way to access all the files, and we have a way to access the database information containing autorship, licence, etc. But we don't the have the resources to build scripts that arrange asset files and information .txt files together. However, the code for drupal 7 is available here https://www.drupal.org/download and the custom module code for OGA is available here https://github.com/OpenGameArt/OGA2-Modules
If you create a tool you'd like to try, we'll take a look.
--Medicine Storm
I will try to make a bash script to see if I can do it in a minimal manner. The tools I make are for linux only. I do not use windows anymore because of security issues.
Hi. I think I might have found a solution for you. You can try this. If it works then you can have a exact copy of your site without the logins. That mean you can zip it up and upload it to a site like archive.org. I host my custom linux iso's there. That could ease the burden on your server. I would setup a linux vm setup to run drupal and make a backup copy of the site and migrate it into the virtual machine and there you can do all your experimenting. Just liek the guy in the article without messing with your main site. This might be the solution and it will retain your website branding so everyone will know it came form you.
https://drupal.stackexchange.com/questions/109156/duplicating-drupal-sit...
I also feel that your site should be preserved on the wayback machine because it is considered an important resource.
Just throwing it out there that I asked this same question a few years ago and did end up making a backup, sort of. My method didn't work too well though and is obviously not a proper way to backup the website, but I'll explain what I did. https://opengameart.org/forumtopic/backing-up-the-entire-website
The first task was to get all of the submission page links. I separated these into the same categories that OGA uses such as 2D, 3D, Music, etc. I made a script that would go through each of the search result pages for each category and copy the submission page links. This is not a good way of doing it though as the search results are a bit weird. Some submissions showed up on multiple pages and it seems a few never showed up. It's like each generated page is cached differently.
The second phase was taking those submission page links and creating a script to scrape those pages. It would create a folder named after the submission. It would copy the submitter name, license, attribution, etc to a text file in the folder. Finally it would copy the actual asset files to the folder. In order to save OGA some bandwidth, I ran the script against the Wayback Machine to download what it had archived. In the end it downloaded approximately 20% of all submissions from the Wayback Machine.
The third phase was downloading from OGA. MedicineStorm and I talked about it and decided the best option was to do a trickle download. I limited my download to 1 Megabit per second. Overall about 25GB was downloaded from the Wayback Machine and about 85GB was downloaded from OGA.
The fourth phase was cleaning up the disaster that I just scraped. Because OGA submission titles allow for characters that can't be used in folder names, I ran into issues where multiple submissions merged into the same folder. I also discovered that the HTML on the submission pages is inconsistent. Some older submissions have different HTML tags and such. This messed up the script in different ways such as not downloading the files, wrong things parsed into the credits files, etc. Thankfully, many of these issues were something I could programically detect and then manually fix or delete the folder and rerun an updated scraping script that would avoid the issue. If I remember correctly, there was one category, I think 2D images, that had about 500 submissions that ended up having a few HTML tags appear next to the author names and in some cases without an author name. I couldn't bring myself to fixing that many manually, so I left them as is.
The archive itself is pretty handy to have. The way I structured the folders and text files makes it easy for most file explorers to parse keywords and show useful results. But with all the issues I encountered, I would be very hesitant to release the archive publicly. If I did, I would have to include a big warning that credits/files were scraped with a script and might be incorrect. Use at your own risk. With that being said, randomly checking submissions showed that they were correct when compared to OGA. Since the whole thing is around 110GB, it's not as big as I expected. But OGA has obviously had new submissions in the past 4 years since the archive was made so a newer archive would be bigger. So if OGA did go down in the near future, there is at least a partial backup that exists and could be shared online if needed. I would prefer that we work towards a more proper backup instead of my hacky mess.
sounds like a plan. have to see if the site owner is ok with that. Your user name is of a major old testament prophet. Cool name.
As Isaiah 65:8 said, we spoke about his plan prior to implementation. There is no issue with doing the same thing as long as you are placing a reasonable throttle on it just has he did.
--Medicine Storm
I might not do the webscraper. The options with using a backup the site offline and removing the user accounts or disabling them could work. Its is up to you.
Take a look at survivorlibrary. I have all the books offline in pdf organized by category. I have the ability to download the html contents of each category page and get the pdf links and download them into the revelant folder. The html pages are small footprint. I do this evert 2 weeks to see if he puts up anything. If your site was organized like this then I could build succh a tool for your site.
I still think having an offline version of your site would be wonderful because you could upload it to archive.org and people can go there to get the downloads. That way your main site is not affected. It is up to you on what you decide. I will not scrape your website because it is not setup like the survivorlibrary.
https://www.survivorlibrary.com/index.php/library-download
I recognize the "backup snapshot scrubbed of user data" is likely the preferred option, and one that does reduce resource costs for our sysadmin team. However, it does still require some work on the part of our sysadmins, so we may not be able to give that solution a try until much later. If time is of the essence, you may elect to try Isaiah's throttled webscraper option, since it doesn't require you to wait on us to have some free time. I'll present it to the sysadmins, but can't promise any sort of timeframe.
--Medicine Storm
I am a little bit busy right now working on the debian dvd. I am porting all the apps from web to binaries an makeing new apps for the king james bible. Its alot of work. I will not get a chance to work on a web scraper for your site right now. Also I need to get some additional storage since I do not know the size of your site. Don't have the cash at the moment for that. I guess by thge time I get the external storage the sys admins might figure out a solution to the is problem. I will say this in final. This site should be preserved on the wayback machine because of its uniqueness and important role to the open source community. It is a valuable treasure.
hey
I am wondering after the microsoft update bug then it should open many peoples eyes on the fragility of the internet. I think the sysadmins should look into moaking the site offline because things will get worse. I think the sysadmin should talk to kiwix about making a offline version of the site. I am sure they would assist and host it for you.
https://kiwix.org/en/
Look at what they have available. OpenGameArt.org could be on the list.
https://library.kiwix.org/#lang=eng