[HacktionLab] Static archive of site - the best command

Tue Jul 18 11:47:00 UTC 2023

HacktionLabers,

If you know a website is going to change significanctly or disappear 
from the web then worth installing the Wayback Machine browser plugin
https://addons.mozilla.org/en-GB/firefox/addon/wayback-machine_new/
https://microsoftedge.microsoft.com/addons/detail/wayback-machine/kjmickeoogghaimmomagaghnogelpcpn
https://chrome.google.com/webstore/detail/wayback-machine/fpnmgdkabkmnadcjpehmlllkndpkmiak 

that you can set to trigger saves to web.archive.org as you browse the 
site, but be warned don't leave this feature on as security as well a 
privacy risk, when you then browse to site that store credentials in 
urls; yes, some still do this.

Also there are other browser plugins that save to archives such as 
archive.is 
<https://web.archive.org/web/20220814004816/https://archive.is/> in 
addtion to  the Wayback Machine.

If anyone wants to convert a Word Press site into static pages using 
HUGO https://gohugo.io/ I can share my experiance of doing this.

You can use script or a plugin to export from WP as MarkDown & config 
ready for HUGO, but there is a way to convert WP db backups.

cheers

Micah aka sb

https://J12.org/sb/

On 18/07/2023 11:59, Charlie Harvey wrote:
> Hi,
>
> For the sake of completeness, here are some other wget params that can
> be useful:
>
> --wait 1  to put a delay between page fetches (if killing your server
> may be an issue)
>
> -e robots=off to ignore robots.txt
>
> -c  to continue if you get halfway through and need to restart
>
> --user-agent="Mozilla"  if the site has cloudflare in front of it (they
> block wget, curl et al by their UA name)
>
> Cheers,
>
> On 18/07/2023 11:14, Mike Harris wrote:
>> Thanks for the suggestions all.  I will try the wget command first as
>> my need is to set up a new WP site for them, whilst providing a static
>> archive of their original site, and then they can link to their old docs.
>>
>> The original site was built using some bespoke hosting company’s thing,
>> called “Webs” or similar, they then got bought by VistaPrint, and then
>> some of the little (maverick) sites like this one (a district
>> association of allotment associations) have been told their sites are
>> going dark with (apparently) no offer of an archive of their site or
>> anything … grrrr >:-(
>>
>>
>>
>> Mike Harris
>>
>> XtreamLab
>> W:https://XtreamLab.net
>> T: +44 7811 671 893
>>
>>> On 18 Jul 2023, at 10:22, Nick Sellen<hacktionlab at nicksellen.co.uk>
>>> wrote:
>>>
>>> 
>>> Also worth a mention of the webarch service to do this -->
>>> https://archived.website/  (which uses httrack
>>> https://www.webarchitects.coop/archiving)
>>>
>>> ------- Original Message -------
>>> On Tuesday, July 18th, 2023 at 08:57, m3shrom<m3shrom at riseup.net>  wrote:
>>>
>>>> This has some good content
>>>>
>>>> https://www.stevenmaude.co.uk/posts/archiving-a-wordpress-site-with-wget-and-hosting-for-free
>>>>
>>>> It's focused on wordpress but potentially relevant for other content.
>>>>
>>>> Sample command I used for a wp network.
>>>>
>>>> wget --page-requisites --convert-links --adjust-extension --mirror
>>>> --span-hosts
>>>> --domains=mcrblogs.co.uk,www.mcrblogs.co.uk,edlab.org.ukmcrblogs.co.uk/afrocats
>>>>
>>>> nice one
>>>> mick
>>>>
>>>> On 17/07/2023 23:23, Mike Harris wrote:
>>>>> Hi all, but especially Mick,
>>>>>
>>>>> Last year Mick gave a talk on recovering the old Schnews website and producing a static version of it by a certain clever use of curl or wget.
>>>>>
>>>>> What’s the best command to get a complete functional static version of the entirety of a website for all linked to content?
>>>>>
>>>>> I ask because I need to grab a site for someone that’s about to ‘go dark’ and no one can get the details to login and get to the file system side of things.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Mike.
>>>>>
>>>>> Mike Harris
>>>>>
>>>>> XtreamLab
>>>>> W:https://XtreamLab.net
>>>>> T: +44 7811 671 893
>>>>> _______________________________________________
>>>>> HacktionLab mailing list
>>>>> HacktionLab at lists.aktivix.org
>>>>> https://lists.aktivix.org/mailman/listinfo/hacktionlab
>>>
-- 
--
https://j12.org/micah/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.aktivix.org/pipermail/hacktionlab/attachments/20230718/51cacf3e/attachment.html>