Version 1.2.0-SNAPSHOT New features - Support (offline copies of) online-only openbooks such as the new Java 8 version of "Java Insel", 12th edition (closes #33). The new book edition is named JAVA_8 and comes with a special hint about the need to manually crawl the publisher's website (e.g. via HTTrack) and manually create a ZIP archive to be processed by GOC. - Verbose error & info message in case of MD5 mismatch (closes #25) - Remove links to online feedback forms (not just the forms themselves) from 29 books in 4,271 files. Inline feedback forms are now only left in 2 books (SHELL_PROG and VB_2008) and 191 files, in the other books they have been replaced by the publisher. Changed features - Several changes to support publisher's re-branding from Galileo to Rheinwerk - Add book WINDOWS_SERVER_2012, remove unavailable book PYTHON - Enhancement: handle truncated (partly downloaded) ZIP file gracefully. Throw exception with message: "Probably the download of was interrupted. Please delete the file and retry." - Bugfix: handle illegal book ID correctly - Bugfix: titles of Java_Insel and Javascript books were swapped in config.xml Internal changes - Improve error handling/logging for failed file system operations such as creating directories or renaming files - Recognise titles in meta tags and not just in explicit title tags - Recognise "Rheinwerk" in titles addition to "Galileo" - Automatically determine the HTML character set via jsoup instead of setting a fixed charset, because the "Java Insel" book for Java 8 uses UTF-8 instead of ISO-8859-1 or windows-1252. The output stream's charset is adjusted accordingly *after* the document was parsed. - Upgrade Java/AspectJ source/target levels to 1.8. - Upgrade AspectJ 1.8.13 and AspectJ Maven 1.11 - Change project directory layout to Maven default - Switch to com.jolira Fork or One-JAR plugin available on Maven Central because the original org.dstovall plugin repository is no longer available Version 1.1.0 New features - New command-line option "-w|--write-config" writes an editable book list file config.xml to current the current directory. This is an easy way for JAR users to extract the embedded default config.xml to disk and edit it subsequently, e.g. in order to add new books not contained in the upstream version. This way they do not have to wait for me to update the Git repo and the downloadable JAR. User self-service. :-) - New command-line option "-c|--check-avail" checks Galileo homepage for available books, comparing them with known ones, cf. "internal changes" section about AvailableBooksChecker. - New command-line option "-m|--check-md5" downloads all known books without storing them, verifying their MD5 checksums (slow! >1 Gb download), cf. "internal changes" section about DownloadChecker. - The CLI options "--check-avail" and "--check-md5" mentioned above can be combined by experienced users or the GOC author to detect changed download packages and re-test them, new books, deleted books and so forth. - In verbose logging mode, the removal of feedback forms is now logged. - New books APPS_IPHONE_IOS6, VB_2012_EINSTIEG. I detected the books using the new tool AvailableBooksChecker. :-) Changed features - Update MD5 for VCSHARP_2012 once again - Rename book APPS_IPHONE to APPS_IPHONE_IOS5 to differentiate it from the new *_IOS6 book Internal changes - Add XStream 1.4.4 as a new external library: XStream is a simple library to serialize objects to XML and back again, see http://xstream.codehaus.org. It is used in this project for loading/saving openbook meta data from/to a config file (config.xml, see below). - Make book list configurable: class Book used to be an enum with hard-coded items. Now it is a simple class containing String values which are read from an XML configuration file. In order to make this work in the IDE and as well as in JAR files, the file config.xml (which was created and is parsed by XStream) is searched in three places in this order: 1. ./config.xml If this local config file is found in either scenario (IDE or JAR) it will be preferred to other locations, which enables the user to override the default version. 2. resource/config.xml This is where the default file is located and found in Eclipse if no local config file exists. From there it will be maintained in Git and, because 'resource' is configured as an Eclipse source folder, added to runnable JARs when creating them from the IDE. 3. /config.xml This is where the default file is located and found in the JAR if no local config file exists. From there it will be read as a resource via Class.getResourceAsStream. - Project can be built with Maven now from Eclipse (added Maven nature to Eclipse settings) as well as from cloud services like * BuildHive: https://buildhive.cloudbees.com/job/kriegaex/job/Galileo-Openbook-Cleaner/ * CloudBees: https://scrum-master.ci.cloudbees.com/job/galileo-openbook-cleaner/ Both of them are in use at the moment, but after a few improvements to the CloudBees version I will probably delete the BuildHive version again because it must copy build artifacts on my personal FTP server while the CloudBees version features its own deployment repository. - Use one-jar Maven plugin to create a JAR of JARs similar to what we had before when creating a "runnable JAR" from Eclipse. - New tool class DownloadChecker checks downloads and verifies their MD5. The class has a 'main' method which can be called by advanced users who know it. Attention: This is a long-running operation (~14 minutes on my old PC using 16 Mbit/s downstream DSL) which as of 2013-03-10 downloads at least 1.1 Gb of data! - New tool class AvailableBooksChecker compares predefined to online books: This class has a 'main' method to be used by advanced users knowing how to interpret the output. Basically two sets of book URLs are compared: the list of known books from config.xml to the lists of books found online on the Galileo Computing and Galileo Design web sites. For each set the orphans, i.e. elements not found in the respective other set, are determined and printed. This helps to discover new, renamed or deleted books. It also helps to spot known books which are still available for download, but no longer listed on the web site. - Upgrade to jsoup-1.7.2 - Refactor JsoupFilter for cleaner code - Refactor LoggingAspect several times for cleaner code and thread safety, trying different approaches until as satisfactory solution was found. - JsoupFiter: update JavaDoc info for getFeedbackFormNeigbourhood - Eclipse .project: remove obsolete reference to JTidy project - FileDownloader: calculate MD5 while reading input, not writing output. This is necessary for making the write-to-file step optional (e.g. for MD5 check only). The constructor now accepts a null value for 'to' and sets the new boolean member 'doWriteToFile' accordingly. The latter is checked later during download. Version 1.0.3 New features - New book UBUNTU_12_04 ("Ubuntu 12.04 'Precise Pangolin'") - New book VCSHARP_2012 ("Visual C# 2012") Changed features - Update MD5 for VCSHARP_2012 and JAVA_INSEL: Galileo Press have changed the download files. Maybe they fixed some errata. Bugfixes - If there are download problems, no more files with size 0 bytes will be created anymore (cf. "internal changes"). - Make sure to keep non-element nodes: When moving the main content of the nested "grey table" directly under the BODY tag and removing the previously surrounding clutter, only element nodes were moved, but other content (such as dangling text nodes) were ignored and thus lost. This affected content in several books in different degrees: actionscript_1_und_2, actionscript_einstieg, apps_iphone, asp_net, dreamweaver_8, excel_2007, it_handbuch, java_7, java_insel, javascript_ajax, joomla_1_5, linux, linux_unix_prog, microsoft_netzwerk, php_pear, python, ruby_on_rails_2, shell_prog, ubuntu_11_04, ubuntu_12_04, unix_guru, vcsharp_2010 Internal changes - Initialise output after input stream for downloads (avoids 0-byte file): In class FileDownloader, FileOutputStream 'outStream' is now only initialised *after* the InputStream on the URL to be downloaded has been opened successfully. This avoids the creation of 0-byte files in case of server, proxy or other connection problems. Version 1.0.2 New feature - Add HTTP proxy support incl. authentication if necessary: JVM parameters for host, port, user, password are supported and can be specified on the command line e.g. like this (without line feeds): java -Dhttp.proxyHost=localhost -Dhttp.proxyPort=8080 -Dhttp.proxyUser=kriegaex -Dhttp.proxyPassword="test XYZ" Bugfixes - Make debugging easier by de-obfuscating callstacks in FileDownloader (cf. "internal changes") Internal changes - FileDownloader: catch all Exceptions in 'finally' block. Otherwise debugging in cases where 'in' or 'out' are null gets difficult because then the application will exit with a NullPointerException thrown from the nested 'try' block inside 'finally' rather than the real exception that happened before. Version 1.0.1 New features - New book ASP_NET ("Einstieg in ASP.NET") Version 1.0 New features - The souce code is now explicitly licenced under GPLv3 (see file COPYING). - Openbook conversion is now much faster because the switch to jsoup (see "internal changes"). - For the same reason the runnable JAR file is now smaller, containing fewer libraries. - Not a new feature, but a new requirement for developers wishing to edit the code or build the program from source: The Eclipse project is no longer a plain Java project, but an AspectJ project because I now use AOP (aspect-oriented programming) techniques for cross-cutting concerns like timing and logging to keep the main application code cleaner and more readable, no longer cluttering it with calls to functionality outside the main scope of cleaning openbooks. - Some minor bugs in page titles and layout (e.g. font sizes in "UNIX guru" book, garbled keyboard symbols in "Excel 2007" book [appendix A, "Tastenkombinationen"]) were fixed. Changed features - CLI (command line interface) * Option '-a' replaced by magic book ID 'all' which can be used instead * Options: show error message + usage info on missing book ID * Options '-d'/'-v' replaced by '-l|--log-level=(0|1|2)' with a default of 0 (normal). 1 is verbose, 2 is debug. * New log level 3 (trace) activates a call trace output on System.err (lower log levels still go to System.out). This is implemented via AspectJ and prints constructor and method signatures. So the trace output is very technical, but a nice toy if you want to avoid debugging the source code in an IDE, but still be able to see what is going on internally. * download_dir is no longer a positional parameter, but a named option '-d|--download-dir' with default '.' (current directory) * Option '-s' replaced by '-t|--threading=(0|1)' with a default of 1 (multi) * Option '-n' replaced by '-p|--pretty-print=(0|1)' with a default of 1 (pretty). This avoids the negative flag "no pretty-print". Later this option was removed altogether because the old filter chain was removed and jsoup introduced. So the flag became redundant because jsoup always pretty-prints. * Option '-?' can also be written as '--help' now Internal changes - Rename BookInfo to just Book and add member 'title' containing a book title to be used on the TOC page (index.htm*). - Temporarily add JCommander 1.26 as CLI parsing tool. Later upgraded to 1.27 and then removed again because of the switch to JOpt Simple. JCommander ist still available in branch cli_jcommander at the time of writing this (commit 0050d681). - Add JOpt Simple 4.3 as CLI parsing tool in class Options. This makes class OpenbookCleaner smaller and cleaner, moving constant USAGE_TEXT and most content of methods processArgs and displayUsageAndExit to Options. Static config variables are also in Options now. So CLI and option handling is nicely encapsulated now. - SimpleLogger: add thread-safe in-/dedent functionality. Thread-safe means: variables indentLevel and indentText are InheritableThreadLocal, not static. - SimpleLogger: optionally log thread ID (static flag LOG_THREAD_ID). If the flag is true, the log is written in tab-separated format (two columns: thread ID and indentation + log message) which can be imported into Excel easily via copy & paste. Then just add a header row, convert into table layout and there you go: filtering, sorting etc. are at your command, making it easy to see what happens in multi-threaded mode. - Start using AspectJ for logging and timing * Converted Eclipse project from Java to AspectJ * New TimingAspect is the stopwatch for the main loop and per book * New LoggingAspect takes care of indented echo/verbose output in classes OpenbookCleaner and Downloader. * Other classes like the filters still do their logging within the application code because they rather use intrinsic, fine-granular logging (several outputs per method) than wrapping comments ("entering", "exiting") which can easily be outsourced to AOP around() advice. * New enum SimpleLogger.IndentMode makes it possible to print and in-/dedent before/after printing in one method call. This helps to keep AOP around() advice code a bit shorter. * Corrected some minor indenting issues in filter classes - Downloader: rename & refactor moveDirectoryContents, closing a to-do: Now for books which are unzipped one directory level too deep it is no longer necessary to rename all files one by one, but the whole subdirectory is moved in three steps: 1) move download_dir/book_id/subdir -> download_dir/book_id.tmp 2) delete empty download_dir/book_id 3) move download_dir/book_id.tmp -> download_dir/book_id - Eclipse: For being able to browse javadocs for external libs offline, I changed all javadoc links to point to folders on my personal hard drive. This is not as interoperable as having them point to online URLs, but it works offline for me. Version 0.9.1.1 Changed features - Openbook Cleaner now runs under Java 6, Java 7 is no longer required Internal changes - Switch zip file handling to Apache Commons Compress 1.4.1. Thanks to user Christian Feneberg for pointing out that Java 7 on MacOS ist still in preview stage and thus not preinstalled on most Macs. Version 0.9.1 New features - CLI (command line interface) * Accept multiple book IDs or '-a' (all books) on command line. This makes convenience class AllOpenbooksCleaner with its own main method obsolete. * New option '-n' skips final pretty-printing after clean up. This optionally saves around 15% processing time, but the final result will not be pretty-printed by JTidy, it remains as written by XOM (which produces correct, but not nicely formatted XHTML). Changed features - Java 7 is now required because of the vcsharp_2008 unpacking bugfix (see below) Bugfixes - Book vcsharp_2008 could not be unpacked because it contains two files with special characters (German umlauts). This was due to a character encoding of "Cp437" while Java always expects UTF-8. Thanks to user Dirk Höpfner for notifying me about the bug. Internal changes - Option parsing: simplify loop and make it more readable - Option parsing: add log output for book_id arguments - Create separate methods for single- and multi-threaded conversion - ZipInputStream now explicitly uses "Cp437" encoding, a new feature available only in Java 7 after years of waiting. - BasicFilter: turn member debugLogMessage into method getDebugLogMessage. This removes the constructor parameter debugLogMessage from BasicFilter, making the constructor signature consistent with that of its subclasses. Also update all concrete subclasses so as to implement the abstract method and to use the changed constructor. - File extensions moved from application class OpenbookCleaner to static member FILE_EXTENSION in all concrete BasicFilter subclasses. - Rename SINGLE_THREADED_WITH_INTERMEDIATE_FILES to MULTI_THREADED. This also means that the logic is reversed now, but still multi-threaded mode is the default, so there is no change in behaviour. - Encapsulate filtering & threading in new class FilterChain. The new class iterates over a Queue of class objects (not instances!) and dynamically sets up the conversion/filtering chain, instantiating concrete BasicFilter subclass objects via reflection. This works in single- as well as multi-threaded mode, as usual using intermediate files for the former and piped streams for the latter. This makes class OpenbookCleaner cleaner, not violating the single purpose principle anymore. Only a nicely readable version of method 'convert' remains, methods 'convertSingleThreaded' and 'convertMultiThreaded' are gone. Furthermore, 'convert' now takes File arguments and no longer File*Stream ones, because FilterChain creates its own streams. Version 0.9 New features - Download, MD5-check and unpack books automatically before filtering - Download hi-res book cover image for each book (can be used as covers for e-books) - Page titles are filtered so as to remove book names and get clean chapter numbers and names. E.g. something like "Galileo Computing :: Book title - 1.2.3 Chapter title" becomes "1.2.3 Chapter title". Special fixes are applied to TOC (index.htm*) pages. - AllOpenbooksCleaner knows new books visualbasic_2008, einstieg_vb_2010 - CLI (command line interface) * Former book path (subdirectory) is now a book ID * New mandatory parameter download directory (also base directory for book subdirectories) is needed because of the new download feature and must be inserted as the first non-option parameter before the book ID. Changed features - Books are now unpacked into subdirectories with predefined names which might be different from the download archive names. E.g. archive galileocomputing_shell_programmierung.zip will be unpacked to folder shell_prog. Internal changes - Refactor pre-JTidy HTML fixing into its own BasicConverter subclass PreJTidyFilter and move some code there. Currently this filter class only fixes the "Ruby on Rails 2" book because other books doe not need to be pre-filtered just to readable by JTidy. - Reorganise package structure and rename some filter (converter) classes & methods - New class (enum) BookInfo to be used by the download/unpack features - New utility classes FileDownloader (incl. MD5 check), ZipFileExtractor - New class Downloader can download & unpack openbooks Version 0.8 - Initial release - Can convert 30 known openbooks, but has not download/unpack functionality yet - Developed in Java7 - Needs libraries * JTidy r938 * XOM 1.2.8 * TagSoup 1.2.1