[HacktionLab] Static archive of site - the best command

Mike Harris mike at mbharris.co.uk
Mon Jul 24 07:21:52 UTC 2023


Hi all,

Thanks for the input, for the command line, which is what I wanted to 
use in order to pull the files directly onto my web server, I got to the 
following:

wget --page-requisites --convert-links --adjust-extension --mirror 
--span-hosts 
--domains=static.websimages.com,thumbs.webs.com,images.webs.com,images.freewebs.com,odfaa.com,mediaprocessor.websimages.com 
-e robots=off --user-agent="Mozilla" --wait 1 -E -H -k -K -p odfaa.com

The "-E -H -k -K -p" was useful to get all the links pulled down and 
converted to local links.

If you're interested in federations of allotment associations, which I 
know you are, then you can see the results at:

https://mbharris.co.uk/odfaa/odfaa.com/

It still needs some work I think, but mostly there.

Cheers,


Mike.

On 18/07/2023 11:59, Charlie Harvey wrote:
> Hi,
>
> For the sake of completeness, here are some other wget params that can
> be useful:
>
> --wait 1  to put a delay between page fetches (if killing your server
> may be an issue)
>
> -e robots=off to ignore robots.txt
>
> -c  to continue if you get halfway through and need to restart
>
> --user-agent="Mozilla"  if the site has cloudflare in front of it (they
> block wget, curl et al by their UA name)
>
> Cheers,
>
> On 18/07/2023 11:14, Mike Harris wrote:
>> Thanks for the suggestions all.  I will try the wget command first as
>> my need is to set up a new WP site for them, whilst providing a static
>> archive of their original site, and then they can link to their old docs.
>>
>> The original site was built using some bespoke hosting company’s thing,
>> called “Webs” or similar, they then got bought by VistaPrint, and then
>> some of the little (maverick) sites like this one (a district
>> association of allotment associations) have been told their sites are
>> going dark with (apparently) no offer of an archive of their site or
>> anything … grrrr >:-(
>>
>>
>>
>> Mike Harris
>>
>> XtreamLab
>> W: https://XtreamLab.net
>> T: +44 7811 671 893
>>
>>> On 18 Jul 2023, at 10:22, Nick Sellen <hacktionlab at nicksellen.co.uk>
>>> wrote:
>>>
>>> 
>>> Also worth a mention of the webarch service to do this -->
>>> https://archived.website/ (which uses httrack
>>> https://www.webarchitects.coop/archiving)
>>>
>>> ------- Original Message -------
>>> On Tuesday, July 18th, 2023 at 08:57, m3shrom <m3shrom at riseup.net> wrote:
>>>
>>>> This has some good content
>>>>
>>>> https://www.stevenmaude.co.uk/posts/archiving-a-wordpress-site-with-wget-and-hosting-for-free
>>>>
>>>> It's focused on wordpress but potentially relevant for other content.
>>>>
>>>> Sample command I used for a wp network.
>>>>
>>>> wget --page-requisites --convert-links --adjust-extension --mirror
>>>> --span-hosts
>>>> --domains=mcrblogs.co.uk,www.mcrblogs.co.uk,edlab.org.ukmcrblogs.co.uk/afrocats
>>>>
>>>> nice one
>>>> mick
>>>>
>>>> On 17/07/2023 23:23, Mike Harris wrote:
>>>>> Hi all, but especially Mick,
>>>>>
>>>>> Last year Mick gave a talk on recovering the old Schnews website and producing a static version of it by a certain clever use of curl or wget.
>>>>>
>>>>> What’s the best command to get a complete functional static version of the entirety of a website for all linked to content?
>>>>>
>>>>> I ask because I need to grab a site for someone that’s about to ‘go dark’ and no one can get the details to login and get to the file system side of things.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Mike.
>>>>>
>>>>> Mike Harris
>>>>>
>>>>> XtreamLab
>>>>> W: https://XtreamLab.net
>>>>> T: +44 7811 671 893
>>>>> _______________________________________________
>>>>> HacktionLab mailing list
>>>>> HacktionLab at lists.aktivix.org
>>>>> https://lists.aktivix.org/mailman/listinfo/hacktionlab
>>> _______________________________________________
>>> HacktionLab mailing list
>>> HacktionLab at lists.aktivix.org
>>> https://lists.aktivix.org/mailman/listinfo/hacktionlab
>> _______________________________________________
>> HacktionLab mailing list
>> HacktionLab at lists.aktivix.org
>> https://lists.aktivix.org/mailman/listinfo/hacktionlab
>>
>
>
> _______________________________________________
> HacktionLab mailing list
> HacktionLab at lists.aktivix.org
> https://lists.aktivix.org/mailman/listinfo/hacktionlab

-- 
---
Mike Harris

Email: mike at mbharris.co.uk
Web: https://mbharris.co.uk . https://xtreamlabnet . https://elsevier.com
In: https://www.linkedin.com/in/mbharris/
GitHub: https://github.com/mikebharris/




More information about the HacktionLab mailing list