Last update: 2002.08.12 ------------------------------------------------------------------- Q. There's a site that I want to grab, but it has a bit of Flash that's getting in the way. Will Flash ever be supported in the future? A. Well, Flash is - proprietary format - binary format The first point is not so important for crawling, but the second is. Parsing a binary format, AND modifying it to fit the local copy is just a nightmare to do. I don't think Flash parsing will be implemented soon, and I even don't know is this will be done in the future. W3C has standardized a much more compatible and pretty vector format, SVG, not yet able to implement media (sound, interaction) things, but wait and see.. ------------------------------------------------------------------- Q. Why does HTTrack spend so much time examining ASP and PHP pages? A. It has to scan then to 'know' their type. Use the 'Options'/'MIME Types' panel to enhance this: choose something like "php,asp" <-> "text/html" - the engine won't scan these files anymore, and will consider them as pure HTML. ------------------------------------------------------------------- Q. When downloads are interrupted by line disconnections - I choose "Continue interrupted download" and it checks all the files and updates the existing ones too. It slows things down tremendously and i cannot get it to just continue where it left off. A. First it is perfectly normal that HTTrack scans the local file structure, but this should be quite fast (except if you downloaded thousands of pages). Ensure that you cleanly interrupted the mirror after the line was shutdown: select Cancel and WAIT for pending connection to close (or break them using SKIP buttons). This will allow downloaded files in background to be taken in account. ------------------------------------------------------------------- Q. I want to grab a site with CGI. Imagine the site is http://www.foo.com/ Then you need to login with a name and password in a page like this http://www.foo.com/login/login.cgi. When you login, I choose a category with some messages that I wanted to grab. The address is something like http://www.foo.com/docs/display.cgi?category=bar I can download the messages one by one by opening each file and saving it to the disk but I can't grab it. A. The problem is that you can 'capture' the first login page through the 'Capture URL' feature of (Win)HTTrack, but you won't be able to select (in the next page) automatically forms, buttons or choices. This would require some automated scripting, and this isn't yet implemented. ------------------------------------------------------------------- Q. Is there any way to use the projects created in WinHTTrack on the command line with httrack.exe? Or perhaps a way to convert the Windows project to a batch script? A. Yes, to continue it, just fire 'httrack --continue' or 'httrack --update' You can also copy/paste the options generated in hts-cache/doit.log (the first line) ------------------------------------------------------------------- Q. Can I set the minimum and maximum size of files to download? A. This is generally NOT a good idea for ALL files; but you can use filters, like in: -*.gif*[<5] to exclude all gif files smaller than 5KB -*.gif*[<5] -*.zip*[<30] to exclude all gif files smaller than 5KB and all zip files smaller than 30KB -**[<5] to exclude ALL files smaller than 5KB (EVEN html files with links inside! Warning!) -*.avi*[>1000] exclude all avi files larger than 1000KB -*.jpg +*.jpg*[<500>100] Exclude all jpg images, except those with size between 100KB and 500KB. ------------------------------------------------------------------- Q. How does HTTrack check for updates in a site? Does it download the entire file, or just check the size or something? A. It does rescan the whole structure, sending update requests to the remote server. Upon detection of modifications, new files are transfered to replace previous ones, and the structure is rebuilt according to new remote files. Update requests are faster than regular requests, and therefore this process is generally much faster than the first download. But note that some servers are sometimes unable to respond properly to update requests, and ALWAYS send "fresh" data, spending more bandwidth than necessary. It is, unfortunately, impossible to avoid that - update handling is done remotely, and the client can not control everything. ------------------------------------------------------------------- Q. I'm worried that these dynamic pages would be redownloaded the next time I update the mirror even though there's no changes in the page itself. A. The way httrack saves/renames *locally* these pages does not change the way httrack does updates, and does not influence the whole update process. The original remote hostname, filename AND query strings are stored in the hts-cache/ file data ; and httrack only use these information to perform the update process. But in fact, the major update process is handled by the remote server, through two important processes: - during the first download, the server has to send a reliable way to tag the file/url ; such as a timestamp (current date+time) or, even better, a strong etag identifier (which can be an md5 hash of the content; which is the "ultimate weapon" for handling updates). This information allow to identify the "freshness" of the data being sent. - during the update, httrack requests the previously downloaded file, giving to the server the "hint" previously sent (timestamp, and/or etag). It is the duty of the server to either respond with a "OK, file not modified" message (304), or using a "OOPS, you have to redownload this file" message (200) With this system, the caching process is totally transparent, and very reliable. That's the theory. Now let's go back to the real world.. Some servers, unfortunately, are really dumb; and just ignore the timestamp/etag ; or do not give any reliable information the first time. Because of that, (offline) browsers like httrack are forced to re-dowload twice data that is identically to the previous version.. clever servers, sometimes, are also unable to "handle cleverly" stupid scripts that just don't care about bandwidth waste and caching problems. Because of that, many websites (especially those with "dynamic" pages) are not "cache compliant", and browsers will always re-download their data. But this is not something a browser can change - only servers could, if only webmasters were concerned about caching problems. (for information, there are ALWAYS methods that allow to cache pages, even dynamic ones, and even those using cookies and other session-related data) -------------------------------------------------------------------