{ "metadata": { "name": "", "signature": "sha256:713f0b55c096798e4432e34705429f99ff790619203b5b1d107b7986ec1e49e4" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "These are some notes I'm making while experimenting with scaling the indexer.\n", "\n", "Scaling problems\n", "================\n", "\n", "We ran with 10,000 W/ARCs, but got some troubling timings. Having tweaked that task number, 5 allowed with 1GB RAM each (TODO Add config details), we have a reasonably fast map phase, taking about two hours to process all 10,000 (and so implying up to 90 hours to process them all, but keep in mind that the total is only as fast as the slowest jobs, i.e. the big WARC files dominate at smaller job sizes, and there was some competition for cluster time). The first time the JISC 1996-2010 collection was indexed, it only required about a soild day's worth of processing time, i.e. about 26 hours.\n", "\n", "
\n", "| # Inputs | # Reducers | Finish Time (Map) | Finish Time (Map+Reduce) |\n", "--------------------------------------------------------------------------\n", "| 100 | 10 | 8mins, 41sec | 8mins, 56sec | 4.4.0\n", "| 1,000 | 10 | 31mins, 3sec | 1hrs, 24mins, 19sec | 4.6.1\n", "| 10,000 | 10 | 2hrs, 45mins, 21sec | 21hrs, 25mins, 56sec | 4.4.0\n", "\n", "\n", "\n", "However, the 10 reducers are failing. They run twice, the first time crashing out with:\n", "\n", "
\n", "Error: java.io.IOException: No space left on device\n", "\tat java.io.FileOutputStream.writeBytes(Native Method)\n", "\tat java.io.FileOutputStream.write(FileOutputStream.java:282)\n", "\tat org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)\n", "\tat java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)\n", "\tat java.io.BufferedOutputStream.write(BufferedOutputStream.java:104)\n", "\tat org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:49)\n", "\tat java.io.DataOutputStream.write(DataOutputStream.java:90)\n", "\tat org.apache.hadoop.mapred.IFileOutputStream.write(IFileOutputStream.java:84)\n", "\tat org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:49)\n", "\tat java.io.DataOutputStream.write(DataOutputStream.java:90)\n", "\tat org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:226)\n", "\tat org.apache.hadoop.mapred.Merger.writeFile(Merger.java:157)\n", "\tat org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2699)\n", "\tat org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2640)\n", "\n", "\n", "then running ok the second time, presumably because there is only enough disk space on some of the machines. [Others have hit this problem](https://issues.apache.org/jira/browse/HADOOP-6092), and this indicates that we might be able to sort things out by clearing up the temporary space that Hadoop is configured to use (TODO add config param info). However, it also implies we are likely to hit an upper limit on the size of job we can process due to the limited about of temp space we have. Note that this should be nothing to do with HDFS free space, because the system temp space is usually held on a different drive to the DFS volumes.\n", "\n", "
\n", "| # Inputs | # Reducers | Shuffle Time | Sort Time | Total Reduce Time |\n", "--------------------------------------------------------------------------------------\n", "| 100 | 10 | 5mins, 21sec | 30sec | 8mins, 51sec |\n", "| 1,000 | 10 | 49mins, 9sec | 23sec | 1hrs, 18mins, 33sec | 4.6.1\n", "| 10,000 | 10 | 10hrs, 41mins, 35sec | 0sec | 10hrs, 42mins, 9sec |\n", "| | | 1hrs, 19mins, 15sec | 43mins, 12sec | 8hrs, 36mins, 27sec |\n", "\n", "\n", "Similarly, it's worth noting that even when it works, the sort is taking 7-10 hours (which is why it takes over 20 hours when it fails once). Somewhat oddly, every single reducer failed on disk space the first time around, i.e. roughtly simultanously, and then worked (no clear correlation between failed nodes the first time, i.e. the same rack etc.). That implies that some kind of temp-space job contention might be the issue.\n", "\n", "Note that there is a lot of lines like this:\n", "\n", "
\n", "ERROR WARCIndexerReducer - No appropriate response record found for: sha1:223JBF7A4BH6TNGCAI2MIPGWOPBNJLBB_http://news.bbcimg.co.uk/media/images/48244000/jpg/_48244565_lorenzo_reuters226i.jpg (revisit)\n", "\n", "\n", "This is a consequence of the fact that that the small sample means the deduplication strategy is failing. Some WARCs are mostly revisits.\n", "\n", "TODO What is the ARC/WARC composition of the 10,000?\n", "ARC 5593\n", "WARC 4407\n", "\n", "ARC 442703\n", "WARC 4494\n", "\n", "https://issues.apache.org/jira/browse/SOLR-4816 means we are suffering on indexing throughput.\n", "\n", "https://wiki.apache.org/solr/SolrCloud\n", "\n", "Also, it seems we are putting too much pressure on the sort now. Perhaps partly due to the link extraction and partly due to the higher binary limit allowing more resources to use up more of the 1MB text field size limit.\n", "\n", "\n", "Deduplication strategies\n", "========================\n", "\n", "We are using the reduce step as out deduplication strategy. Items with the same URL and content hash are grouped together, and only a single SOLR record is submitted for each one.\n", "\n", "To resolve this, we had to properly calculate the hash of the ARCs and allow for multiple crawl dates, and query Solr during the map to decide whether to send an update to the crawl_dates or not.\n", "\n", "Rebuilding the indexer\n", "======================\n", "\n", "So, the indexer has been rebuilt. \n", "\n", "* Uses new duplicate handling logic.\n", "* Requests compression of the map output.\n", "* Face detection, colour extraction.\n", "* ...\n", "\n", "All 'expensive' features are switched on.\n", "\n", "Now we need new timings. Started with ten inputs, but a new ten, so numbers will not be directly comparable.\n", "\n", "For 10, Total time: 00:21:52.\n", "Map, Worst case 00:20:00, most most around 00:08:00.\n", "Reduce time, 18 mins but this includes the long running shuffle and sort while awaiting the slowest map.\n", "Actual reduce action time was 30-40 seconds.\n", "\n", "On 100, hit problems with empty/malformed payloads that killed the job. Fixing this and re-launching.\n", "\n", "We are getting DEBUG output from org.apache.zookeeper and it's not clear why.\n", "\n", "Rather slow to warm up and get going. Hopefully this is mapper initialisation and we'll pick up speed shortly. OK Looks like SOLR crashed, and the clients are waiting for it. It was an OOM, but actually 'unable to create new native thread', which is a ulimit thing. Need up up the ulimits for the tomcat user and restart the cluster.\n", "\n", "Ok, re-running with SOLR rebooted.\n", "\n", "For 100:\n", "Total time: 00:43:21\n", "Maps, 6-20 mins per input.\n", "Reducers, 38 mins overall, but actual submission to Solr only about 2 minutes. (Seems much faster.)\n", "This would mean 20 weeks! Need timings from 1000 to confirm relilability of this estimate.\n", "\n", "For 1000:\n", "Some contention, other indexing jobs running at the same time.\n", "Now ArchiveCDXGenerator and sorter jobs kicked in, all competing for map time.\n", "Total time: 15:43:40\n", "Maps: Most 1-2 hrs, worst case was around 11 hours!\n", "Reduce phase took about two hours, but was overlapping with another job in the reduce phase.\n", "\n", "Files taking many hours (>8 hours) to map\n", "Processing path: hdfs://nellie-private:54310/ia/PHASE2WARCS/DOTUK-HISTORICAL-1996-2010-PHASE2WARCS-XAAAAZ-20111115000000-000000.warc.gz\n", "\n", "\n", "\n", "Speeding things up\n", "------------------\n", "\n", "To speed things up, we can go to the other extreme and try switching off lots of features.\n", "\n", "\n", "* Disabling both Image and PDF analysis.\n", " * Local test: 77.89 seconds -> 45.185 seconds.\n", "* Disabling PDF analysis.\n", " * Local test: 77.89 seconds -> 74.764 seconds.\n", "* Disabling Image analysis.\n", " * Local test: 77.89 seconds -> 51.093 seconds.\n", "* Up the limit on in-memory content processing from 1MB to 10MB:\n", " * Local test: 81.992 seconds -> 77.89 seconds.\n", "* Dropping maximum text to extract from 1024K to 1K:\n", " * Local test: 45.185 seconds -> 42.631 seconds.\n", "* Dropping maximum bytes to allow Tika to parse from ALL to 1K:\n", " * Local test: 64.982 seconds -> 53.949 seconds.\n", " \n", "So, dropping the image analysis made a large difference. Given the pressures involved right now, it probably makes more sense to disable these features (which are of relatively little interest to the main BUDDAH researchers right now). \n", "\n", "Rerunning with image and PDF features turned off.\n", "\n", "For 1000:\n", "Note some contention with previous job, but mostly in the reducer phase.\n", "Total time: 10:27:01.\n", "Mappers better behaved, worst case now 04:20:13.\n", "Reduce phase approx 3.5 hours, but heavy contention with other reduces from previous job.\n", "\n", "4hr worst case map:\n", "hdfs://nellie-private:54310/ia/PHASE2WARCS/DOTUK-HISTORICAL-1996-2010-PHASE2WARCS-XAABLX-20111115000000-000000.warc.gz\n", "\n", "Also bad: \n", "hdfs://nellie-private:54310/ia/PHASE2WARCS/DOTUK-HISTORICAL-1996-2010-PHASE2WARCS-XAAANR-20111115000000-000001.warc.gz\n", "hdfs://nellie-private:54310/ia/PHASE2WARCS/DOTUK-HISTORICAL-1996-2010-PHASE2WARCS-XAAAAZ-20111115000000-000000.warc.gz\n", "\n", "So, setting up a local test with the worst file (XAAAAZ-20111115000000-000000, which is only 0.6GB, so it's probably not raw size that's the problem).\n", "Hang on, that's a bad idea, cos it's probably take 4 hours!\n", "Adding logging to see where it gets stuck...\n", "Oh dear. It worked fine.\n", "\"Finished in 2791.518 seconds.\"\n", "\n", "Running again, but excluding most formats from Tika processing (as we usually do): 2749.475 seconds!\n", "Didn't beleve it, so disabled the excludes (i.e. all in) once more: 2018.007 seconds!?\n", "\n", "Ok, so cleaning up code and disabling Tika for problematic types, and rerunning on the 100 with no contention gives...\n", "Total time: 00:32:07\n", "Mappers roughly 20-30mins.\n", "Reducers e.g. 00:01:22 i.e. around a minute.\n", "\n", "Trying again with the 1000, although with some contention and HDFS is extremely full which may be causing problems...\n", "Total time:\n", "\n", "\n", "Other things to try:\n", "- Disable recursive parsing in Tika.\n", "- ONLY hash the first X bytes, and use that for dedup.\n", "\n", "100, 30 mins?\n", "\n", "Okay, so back on the cluster, with the solr check switched off (relying on updates instead of managing that myself and querying for every resources), and WHOA that's better. The shuf-100 job runs in about 10mins (instead of 30mins)\n", "Total time:\n", "Mappers: 10 mins.\n", "Reducers: FAILED\n", "\n", "Still pretty slow on the reduce. 35 might be hammering it, but still. Maybe need to try cutting down on some fields, e.g. href indexing, possibly even worth ignoring host-level links and just do domain-level.\n", "\n", "Ah, no, my fault. It's the new crawl_years field. Nice work Jackson.\n", "So, taken that out.\n", "\n", "Ran again with 100 inputs. Mapper nice and quick, and 35 reducers coped this time.\n", "Total time: 00:17:52\n", "Mappers: c. 8 mins.\n", "Reducers: c. 10 mins.\n", "\n", "Ran with 1000 inputs, mappers quite quick, but reducers kept dying. Dropping the number of reducers to 20, and it seems stable.\n", "\n", "Total time: > 01:57:00\n", "Mappers: c. 30 mins.\n", "Reducers: lots more time: > 1hr. Hmmm, after about 94.33% of the reduce phase, Solr is locking up. Kill the job and it slowly recovers, which is good.\n", "\n", "So, things to try next: switch off link analysis. \n", "Trying it locally, on the XAAAAZ-20111115000000-000000 test file.\n", "Finished in 3800.776 seconds.\n", "Hmmm. Totes inconclusive due to variability of timings on the laptop.\n", "\n", "So, on the cluster, and without links, for the 1000:\n", "After ten minutes, 85% done!\n", "After 15 minutes, 99% done!\n", "After 20 mins, 99.95% done!\n", "After 25 mins, 100% done! (only a couple of WARCs at the end, so more to be gained by running more inputs at once)\n", "After c.28 mins, sort is also complete.\n", "Reducers stuck after nearlu eight hours! c.94% complete.\n", "KILLING\n", "\n", "So, trying dropping reducers to 10. Still dying, but maybe Solr is very grumpy. Restarting Solr.\n", "So, after 01:51:01 it is done.\n", "Mappers are quick, index still slow. i.e. about 30-odd mins mappers, about a hour and a quarter indexing.\n", "\n", "Oh, links were still switched on!? Trying again without the links... Emptying SOLR...\n", "\n", "Locally, switching off multiple features to see how it changes things. no host links, no binary shingling, no 'elements used':\n", "Finished in 2399.014 seconds.\n", "Hm. Ok, also dropping text payload right down to 10K:\n", "Finished in 2657.414 seconds.\n", "\n", "So, back on the cluster, with the links off and an empty target Solr:\n", "Total time: 01:06:13 i.e. nearly half the time.\n", "\n", "Also disabled first_bytes, but left data in Solr:\n", "Total time: 01:26:03\n", "Slightly longer, probably because it was doing updates not just replacements.\n", "\n", "So, cleared the data out, and also dropped the text payload size down to 10KB:\n", "Total time: 01:03:54\n", "Nice.\n", "\n", "Finally, knocking down the number of reducers to 5, to see if that makes much difference. On an empty Solr.\n", "Yes, it did run a bit slower, but not massively.\n", "Total time: 01:27:06\n", "\n", "Now running on 10,000, leaving reducers at 5...\n", "Map time, about two hours!\n", "Reduce Copy kinda slow: reduce > copy (2215 of 10000 at 7.75 MB/s)\n", "CRASH with only 5 reducers we run out of disk space...\n", "\n", "Upping back to 10 reducers.\n", "\n", "Try 15? First, lets try to see why there's so much data.\n", "\n", "Dropped the elements_used, in case there was some weirdness there. Was the same (crashed out of disk space during shuffle).\n", "\n", "Dropping the hosts, in case it's those cheeky link farms that are to blame.\n", "\n", "Little difference. Trying dropping text load.\n", "\n", "NOTE Looked in \n", "\n", " /mapred/local/dir/taskTracker/anjackson/jobcache/job_201402191107_1551/attempt_201402191107_1551_r_000004_0/output\n", " \n", "and the output is clearly NOT compressed. And some are BIG, and look like link-farm mess.\n", "\n", "Even after reducing the text load to 50K, this uncompressed data still failed on some reducers. Eventually some got through, only to cripple the Solr server (still just 10 reducers). Those that got to Solr failed like this.\n", "\n", "
\n", "2014-03-30 16:29:26 INFO WARCIndexerReducer:111 - Submitted 500 docs [0]\n", "2014-03-30 16:30:30 ERROR WARCIndexerReducer:116 - WARCIndexerReducer.reduce(): No live SolrServers available to handle this request:[http://192.168.1.180:8994/solr/jisc3]\n", "org.apache.solr.client.solrj.impl.CloudSolrServer$RouteException: No live SolrServers available to handle this request:[http://192.168.1.180:8994/solr/jisc3]\n", "\tat org.apache.solr.client.solrj.impl.CloudSolrServer.directUpdate(CloudSolrServer.java:351)\n", "\tat org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:510)\n", "\tat org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)\n", "\tat org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)\n", "\tat org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)\n", "\tat uk.bl.wa.solr.SolrWebServer.add(SolrWebServer.java:106)\n", "\tat uk.bl.wa.hadoop.indexer.WARCIndexerReducer.checkSubmission(WARCIndexerReducer.java:110)\n", "\tat uk.bl.wa.hadoop.indexer.WARCIndexerReducer.reduce(WARCIndexerReducer.java:84)\n", "\tat uk.bl.wa.hadoop.indexer.WARCIndexerReducer.reduce(WARCIndexerReducer.java:29)\n", "\tat org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:469)\n", "\tat org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)\n", "\tat org.apache.hadoop.mapred.Child$4.run(Child.java:270)\n", "\tat java.security.AccessController.doPrivileged(Native Method)\n", "\tat javax.security.auth.Subject.doAs(Subject.java:396)\n", "\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)\n", "\tat org.apache.hadoop.mapred.Child.main(Child.java:264)\n", "Caused by: org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request:[http://192.168.1.180:8994/solr/jisc3]\n", "\tat org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:354)\n", "\tat org.apache.solr.client.solrj.impl.CloudSolrServer$1.call(CloudSolrServer.java:332)\n", "\tat org.apache.solr.client.solrj.impl.CloudSolrServer$1.call(CloudSolrServer.java:329)\n", "\tat java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)\n", "\tat java.util.concurrent.FutureTask.run(FutureTask.java:138)\n", "\tat java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)\n", "\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)\n", "\tat java.lang.Thread.run(Thread.java:662)\n", "Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Cannot talk to ZooKeeper - Updates are disabled.\n", "\tat org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:495)\n", "\tat org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:199)\n", "\tat org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:285)\n", "\t... 7 more\n", "2014-03-30 16:39:49 ERROR WARCIndexerReducer:116 - WARCIndexerReducer.reduce(): No live SolrServers available to handle this request:[http://192.168.1.180:8988/solr/jisc3, http://192.168.1.180:8983/solr/jisc3, http://192.168.1.180:8996/solr/jisc3, http://192.168.1.180:8994/solr/jisc3]\n", "org.apache.solr.client.solrj.impl.CloudSolrServer$RouteException: No live SolrServers available to handle this request:[http://192.168.1.180:8988/solr/jisc3, http://192.168.1.180:8983/solr/jisc3, http://192.168.1.180:8996/solr/jisc3, http://192.168.1.180:8994/solr/jisc3]\n", "\tat org.apache.solr.client.solrj.impl.CloudSolrServer.directUpdate(CloudSolrServer.java:351)\n", "\tat org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:510)\n", "\tat org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)\n", "\tat org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)\n", "\tat org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)\n", "\tat uk.bl.wa.solr.SolrWebServer.add(SolrWebServer.java:106)\n", "\tat uk.bl.wa.hadoop.indexer.WARCIndexerReducer.checkSubmission(WARCIndexerReducer.java:110)\n", "\tat uk.bl.wa.hadoop.indexer.WARCIndexerReducer.reduce(WARCIndexerReducer.java:84)\n", "\tat uk.bl.wa.hadoop.indexer.WARCIndexerReducer.reduce(WARCIndexerReducer.java:29)\n", "\tat org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:469)\n", "\tat org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)\n", "\tat org.apache.hadoop.mapred.Child$4.run(Child.java:270)\n", "\tat java.security.AccessController.doPrivileged(Native Method)\n", "\tat javax.security.auth.Subject.doAs(Subject.java:396)\n", "\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)\n", "\tat org.apache.hadoop.mapred.Child.main(Child.java:264)\n", "Caused by: org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request:[http://192.168.1.180:8988/solr/jisc3, http://192.168.1.180:8983/solr/jisc3, http://192.168.1.180:8996/solr/jisc3, http://192.168.1.180:8994/solr/jisc3]\n", "\tat org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:354)\n", "\tat org.apache.solr.client.solrj.impl.CloudSolrServer$1.call(CloudSolrServer.java:332)\n", "\tat org.apache.solr.client.solrj.impl.CloudSolrServer$1.call(CloudSolrServer.java:329)\n", "\tat java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)\n", "\tat java.util.concurrent.FutureTask.run(FutureTask.java:138)\n", "\tat java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)\n", "\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)\n", "\tat java.lang.Thread.run(Thread.java:662)\n", "Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Cannot talk to ZooKeeper - Updates are disabled.\n", "\tat org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:495)\n", "\tat org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:199)\n", "\tat org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:285)\n", "\t... 7 more\n", "\n", " \n", "\n", "So, it seems the issue is simply that of scale. The JISC collection has very large numbers of resources per input file, and this stretches the size of the mapper outputs to the system's limits.\n", "\n", "Trying one more run, switching on the PDF checker, and enabling compression, to see how it goes... That seemed to work! Output/temp files look compressed. Also, PDF Preflight tests appeared to add a negligible amount of processing (all done at 02:18:00). Shuffle & Merge seemed to work ok, although some disk space grumbling. Sadly, SOLRs still down, but 5min pause should help, I think. Restarting SOLRs... So, worked a bit, but killed the SOLRs pretty quickly.\n", "After 07:30:00 the merge was complete (although many finished before then - using the host as the key is not very balanced).\n", "BUT after 15:00:00 still locked with only some successful submission. A coupled of reducers got over 69%.\n", "\n", "Just realised all the reducers are running on the same nodes! That's not helping... Num reducers reduced to one?\n", "\n", "To confirm this, I'd like to understand what happened for the earlier indexing processes. For AADDA, I assume the difference is largely down to the links. For LDWA, it seems unclear, as I recall Roger processing that in one go!\n", "\n", "Ok, so confirmed with Roger that he did it in c.34 chunks of about about a thousand WARCs each (33,102 in total). Given there are 1.1 billion items in the index, this still seems to be rather good performance compared to what we see now. Perhaps the number of reducers per node was lower? Maybe only two, at 2GB each.\n", "\n", "Rough history.\n", "\n", "First, 1GB got both, and up to 8 of either mappers or reducers.\n", "Then, 2GB and 2/2 mappers/reducers.\n", "Currently, 1GB and 5/5 mappers/reducers.\n", "\n", "Timing is right for the LDWA index to be during the 2R/node period, which may well explain why that worked ok.\n", "\n", "Indexing directly onto HDFS\n", "===========================\n", "The second JISC2 index appears to have failed, so major change of tack required.\n", "\n", "Indexing Old News\n", "-----------------\n", "With Lewis, I chopped up some Cloudera code and managed to build multiple shards directly on HDFS during the Reduce phase.\n", "\n", "Working through how to index locally during MapReduce.\n", "Using JISC2 as test data.\n", "Indexed 9, got 63 pages, including \"lsidyv10a49/p10\"\n", "Next, 100 items from the tail, and then the 9 from the top, to check the old ones are not overwritten/lost.\n", "\n", "So, in 3mins, 100 issues pulled down and indexed. 622 distinct pages.\n", "Now attempting 10,000! 57 minutes! i.e. the whole thing in a day!\n", "One Mapper took three-times longer, which is a bit weird.\n", "Ran another 10,000, and 57 mins again! With one slow mapper! Perhaps some of the nodes are smaller and slower than the others.\n", "Running with 50,000, mapper is nice and fast. should be c. 5 hrs for linear speedup.\n", "So, reducers occasionally failed, with only 1GiB of RAM. Upped to 1.5GiB.\n", "\n", "But eventually (10 hrs) they all ran leaving an index with 455,122 pages in it.\n", "Rerunning with increased RAM and empty indexes, to check it's all ok.\n", "4hrs, 37mins, 49sec all good.\n", "Should be shorter, as most of the reducers took 30 mins, but one node is slow (openstack8) at 1.5hrs.\n", "\n", "hdfs://openstack2.ad.bl.uk:8020/user/anjackson/newindex1\n", "\n", "Using the new output format should be neater, but depends on Hadoop 2.x.x which will no doubt cause PAIN.\n", "I can use the hacked together logic easily enough.\n", "\n", "
\n", "2014-04-09 16:38:22,343 INFO org.apache.solr.core.CoreContainer: registering core: core1\n", "2014-04-09 16:38:22,343 INFO org.apache.solr.core.SolrCore: QuerySenderListener sending requests to Searcher@5a90f357 main{StandardDirectoryReader(segments_q:2503 _ti(4.4):C49232 _mi(4.4):C2160 _mq(4.4):C127 _v7(4.4):C5538 _pi(4.4):C100 _xg(4.4):C5570 _qz(4.4):C86 _yk(4.4):C430 _ta(4.4):C90 _v1(4.4):C115 _vh(4.4):C2269 _vz(4.4):C87 _wx(4.4):C92 _wz(4.4):C15 _x6(4.4):C157 _xr(4.4):C596 _y0(4.4):C533 _ya(4.4):C553 _xq(4.4):C1960 _xs(4.4):C569 _xt(4.4):C228 _xu(4.4):C135 _yd(4.4):C24 _ye(4.4):C8 _yj(4.4):C56 _yl(4.4):C55 _ym(4.4):C65 _yn(4.4):C56 _yo(4.4):C48 _yp(4.4):C67 _yq(4.4):C21)}\n", "2014-04-09 16:38:23,447 INFO org.apache.solr.update.LoggingInfoStream: [IFD][main]: init: current segments file is \"segments_q\"; deletionPolicy=org.apache.solr.core.IndexDeletionPolicyWrapper@75bb31b9\n", "2014-04-09 16:38:23,500 INFO org.apache.solr.update.LoggingInfoStream: [IFD][main]: init: load commit \"segments_n\"\n", "2014-04-09 16:38:24,995 INFO org.apache.solr.update.LoggingInfoStream: [IFD][main]: init: load commit \"segments_q\"\n", "2014-04-09 16:38:25,145 INFO org.apache.solr.core.SolrCore: [core1] webapp=null path=null params={event=firstSearcher&q=static+firstSearcher+warming+in+solrconfig.xml&distrib=false} hits=69904 status=0 QTime=2801 \n", "2014-04-09 16:38:25,163 INFO org.apache.solr.core.SolrCore: SolrDeletionPolicy.onInit: commits: num=2\n", "\tcommit{dir=NRTCachingDirectory(org.apache.solr.store.hdfs.HdfsDirectory@772ce69f lockFactory=org.apache.solr.store.hdfs.HdfsLockFactory@7348fb70; maxCacheMB=192.0 maxMergeSizeMB=16.0),segFN=segments_n,generation=23}\n", "\tcommit{dir=NRTCachingDirectory(org.apache.solr.store.hdfs.HdfsDirectory@772ce69f lockFactory=org.apache.solr.store.hdfs.HdfsLockFactory@7348fb70; maxCacheMB=192.0 maxMergeSizeMB=16.0),segFN=segments_q,generation=26}\n", "2014-04-09 16:38:25,166 INFO org.apache.solr.core.SolrCore: newest commit generation = 26\n", "\n", "\n", "Ok, so old commits (for syncing) are being dropped, which is fine. All results should be there.\n", "\n", "Okay, took (19GiB!) impalad off openstack8 (which appears to be running a desktop) and it runs quicker.\n", "\n", "Futzing with Solr JISC2 Newspapers.\n", "\n", "shard3: 273,160 docs\n", "\n", "Okay, made a new core, a replica of shard3, and swapped over folders.\n", "https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201312.mbox/%3Ca793e6444ceb454694f78e027e2fcb3f@BLUPR06MB417.namprd06.prod.outlook.com%3E\n", "Filled out all fields including HDFS URLs copied and modified from source node.\n", "Then swapped the folders. Then \n", "1,094,330 documents still.\n", "\n", "Shard0-3 actually in wrong places, but it still worked!\n", "\n", "shard1 refers to /solr/jisc2/core_node3 (has shard3)\n", "shard2 refers to /solr/jisc2/core_node2 (has shard2)\n", "shard3 was in node4, now /solr/jisc2/core_node5 (has shard4)\n", "shard4 refers to /solr/jisc2/core_node1 (has shard1)\n", "\n", "So:\n", "core_node1:shard1 needs to move to shard1:core_node3\n", "core_node2:shard2 is fine\n", "core_node3:shard3 needs to move to shard3:core_node5\n", "core_node5:shard4 needs to move to shard4:code_node1\n", "\n", "Doesn't seem to matter until you add documents!\n", "\n", "http://openstack9.ad.bl.uk:8983/solr/#/jisc2_shard1_replica1/query\n", "\n", "Note these are set up much as per: https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS\n", "\n", "### Notes sent to Lewis ###\n", "Once I\u2019d worked out how to index onto HDFS and avoid openstack8 getting bogged down (I think it\u2019s running a desktop session too?), I managed to get the indexing time down to 3 hours per 50,000 newspapers \u2013 i.e. the total run of 192,349 newspapers took about 12 hours to index. In fact, I think this is a significant overestimate as openstack8 was still running somewhat slower than the others, and because there seems to be duplicate rows in the table (see below) implying that around a third of the data was indexed twice.\n", " \n", "I used four mappers to pull down the content, and then distributed the results to four reducers that each build a single Solr core. The four simultaneous clients downloading the OCR XML file did not cause DLS any issues, and grabbing them over HTTP did not seem to be a significant bottleneck at this scale. Each XML file was downloaded, and a distinct Solr record was created for each page (this is somewhat arbitrary \u2013 it could be by article or issue instead). The final index contains 1,094,330 distinct pages, but note that pages with no text were discarded (which was probably a mistake in retrospect, as knowing how many pages have no text might be useful/interesting).\n", " \n", "You can see the results via: http://openstack9.ad.bl.uk:8983/solr/#/jisc2_shard1_replica1/query\n", " \n", "And construct queries like this:\n", " \n", "* To see the pages that contain the term \u201cBritish Museum\u201d, sorted by date: http://openstack9.ad.bl.uk:8983/solr/jisc2_shard1_replica1/select?q=%22British%20Museum%22&sort=year_s+asc&rows=20&fl=originalname_s%2C+page_i&wt=xml&indent=true&hl=true&hl.fl=content&hl.fragsize=200&hl.simple.pre=*&hl.simple.post=*\n", "* To see the distribution of all pages across the years: http://openstack9.ad.bl.uk:8983/solr/jisc2_shard1_replica1/select?q=*%3A*&rows=0&wt=json&indent=true&facet=true&facet.field=year_s or as XML http://openstack7.ad.bl.uk:8983/solr/jisc2_shard2_replica1/select?q=*%3A*&rows=0&wt=xml&indent=true&facet=true&facet.field=year_s\n", "* Mentions of \u201cA study in scarlet\u201d, sorted by time, showing fragments for context: http://openstack9.ad.bl.uk:8983/solr/jisc2_shard1_replica1/select?q=%22A+study+in+scarlet%22&sort=year_s+asc&rows=20&fl=originalname_s%2C+page_i&wt=xml&indent=true&hl=true&hl.fl=content&hl.fragsize=200&hl.simple.pre=*&hl.simple.post=*\n", "* Mentions of \u201cSherlock holmes\u201d, as above, but as a short-range proximity search (i.e. up to one word apart): http://openstack9.ad.bl.uk:8983/solr/jisc2_shard1_replica1/select?q=%22sherlock%20holmes%22~1&sort=year_s+asc&rows=20&fl=originalname_s%2C+page_i&wt=xml&indent=true&hl=true&hl.fl=content&hl.fragsize=200&hl.simple.pre=*&hl.simple.post=*\n", " \n", "Solr also provides an interface for summary statistics, here: http://openstack9.ad.bl.uk:8983/solr/#/jisc2_shard1_replica1/schema-browser\n", "e.g. you can select the field \u2018simpletitle_s\u2019 and the hit \u2018Load Term Info\u2019 to see the distribution of titles (we have 39, and the largest chunk of content is from the Morning Post). Similarly, you can select \u2018originalname_s\u2019 and see there are 146,110 distinct XML filenames, in contrast to the 192,349 lines from the HIVE table, which would appear to imply there are duplicate lines in the database.\n", " \n", "I\u2019ve not moved to use openstack6 instead of 8 yet, as I\u2019m not sure how to do this cleanly without risking breaking what we have.\n", "\n", "Back to WARC\n", "------------\n", "\n", " Caused by: java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FileAlreadyExistsException\n", "\t at org.apache.solr.store.hdfs.HdfsLockFactory.makeLock(HdfsLockFactory.java:53)\n", "\t at org.apache.lucene.store.BaseDirectory.makeLock(BaseDirectory.java:41)\n", "\t at org.apache.solr.store.blockcache.BlockDirectory.makeLock(BlockDirectory.java:283)\n", " \tat org.apache.lucene.store.NRTCachingDirectory.makeLock(NRTCachingDirectory.java:109)\n", "\t at org.apache.lucene.index.IndexWriter.
\n", " Apr 14, 2014 5:09:25 PM org.apache.catalina.startup.Catalina start\n", " INFO: Server startup in 319574 ms 3132 2014-04-14 17:09:25.504; [recoveryExecutor-6-thread-1] WARN org.apache.solr.update.UpdateLog \u00e2 Starting log replay tlog{file=/opt/data/solrnode3/ldwadev/data/tlog/tlog.0000000000000000254 refcount=2} active=false starting pos=05504 2014-04-14 17:09:27.876; [recoveryExecutor-6-thread-1] ERROR org.apache.solr.update.UpdateLog \u00e2 java.io.EOFException\n", " at org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:154)\n", " at org.apache.solr.common.util.JavaBinCodec.readStr(JavaBinCodec.java:559)\n", " at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:180)\n", " at org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:477)\n", " at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186)\n", " at org.apache.solr.common.util.JavaBinCodec.readSolrInputDocument(JavaBinCodec.java:393)\n", " at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:229)\n", " at org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:477)\n", " at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186)\n", " at org.apache.solr.update.TransactionLog$LogReader.next(TransactionLog.java:630)\n", " at org.apache.solr.update.UpdateLog$LogReplayer.doReplay(UpdateLog.java:1272)\n", " at org.apache.solr.update.UpdateLog$LogReplayer.run(UpdateLog.java:1215)\n", " at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)\n", " at java.util.concurrent.FutureTask.run(FutureTask.java:262)\n", " at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)\n", " at java.util.concurrent.FutureTask.run(FutureTask.java:262)\n", " at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)\n", " at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n", " at java.lang.Thread.run(Thread.java:744)\n", "\n", "\n", "Interesting. Fortunately it doesn't seem to have significantly damaged the data (the transaction log is only really a 'back up' to be used in case the index itself is not shut down correctly), as we have 6.1 million URLs which is roughly what I'd expect from 1/40th of 2.5 billion items.\n", "\n", "As a precaution, I've modified the code so that the reducer waits while the data is committed to disk. I'd assumed that was the default, but actually the embedded Solr server we use here runs most tasks on background threads. The situation we've seen here is consistent with the background thread having been forcefully killed, so adding a blocking 'wait' should resolve it, hopefully.\n", "\n", "Fixed up the tests again, enabled solr.lock.type configuration, Roger helped resolve a classpath issue (hadoop-core should be 'provided', c.f. http://answers.mapr.com/questions/4811/numberformatexception-setting-up-job). And now running on a random 20,000 input files, which we can expect to take 20-25 hours.\n", "\n", "So, running 5 things on grunt22, some are hanging. One was very unhappy. It had this message earlier on:\n", "\n", " Exception in thread \"Lucene Merge Thread #17\" org.apache.lucene.index.MergePolicy$MergeException: java.lang.NullPointerException\n", " at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545)\n", " at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518)\n", " Caused by: java.lang.NullPointerException\n", " at org.apache.lucene.util.packed.MonotonicAppendingLongBuffer.get(MonotonicAppendingLongBuffer.java:75)\n", " at org.apache.lucene.util.packed.AbstractAppendingLongBuffer.get(AbstractAppendingLongBuffer.java:101)\n", " at org.apache.lucene.index.MultiDocValues$OrdinalMap.getGlobalOrd(MultiDocValues.java:390)\n", " at org.apache.lucene.codecs.DocValuesConsumer$7$1.setNext(DocValuesConsumer.java:610)\n", " at org.apache.lucene.codecs.DocValuesConsumer$7$1.hasNext(DocValuesConsumer.java:558)\n", " at org.apache.lucene.codecs.lucene45.Lucene45DocValuesConsumer.addNumericField(Lucene45DocValuesConsumer.java:141)\n", " at org.apache.lucene.codecs.lucene45.Lucene45DocValuesConsumer.addSortedSetField(Lucene45DocValuesConsumer.java:414)\n", " at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.addSortedSetField(PerFieldDocValuesFormat.java:121)\n", " at org.apache.lucene.codecs.DocValuesConsumer.mergeSortedSetField(DocValuesConsumer.java:441)\n", " at org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:207)\n", " at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:116)\n", " at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4146)\n", " at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3743)\n", " at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)\n", " at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)\n", " Exception in thread \"Lucene Merge Thread #18\" org.apache.lucene.index.MergePolicy$MergeException: java.lang.NullPointerException\n", " at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545)\n", " at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518)\n", " Caused by: java.lang.NullPointerException\n", " Exception in thread \"Lucene Merge Thread #19\" org.apache.lucene.index.MergePolicy$MergeException: java.lang.NullPointerException\n", " at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545)\n", " at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518)\n", " Caused by: java.lang.NullPointerException\n", " \n", "and then somewhat later, lots of\n", "\n", " 2014-04-16 09:46:17 INFO UpdateHandler:540 - start commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}\n", " 2014-04-16 09:46:17 ERROR CommitTracker:120 - auto commit error...:org.apache.lucene.index.CorruptIndexException: codec header mismatch: actual header=1701604449 vs expected header=1071082519 (resource: _mq_Lucene41_0.tip)\n", " at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:128)\n", " at org.apache.lucene.util.fst.FST.
\n", "Error: GC overhead limit exceeded\n", "Error: Java heap space\n", "2014-04-25 02:41:45 FATAL Child:318 - Error running child : java.lang.OutOfMemoryError: Java heap space\n", "\tat java.lang.StringCoding$StringEncoder.encode(StringCoding.java:232)\n", "\tat java.lang.StringCoding.encode(StringCoding.java:272)\n", "\tat java.lang.String.getBytes(String.java:946)\n", "\tat uk.bl.wa.solr.TikaExtractor.extract(TikaExtractor.java:241)\n", "\tat uk.bl.wa.analyser.payload.WARCPayloadAnalysers.analyse(WARCPayloadAnalysers.java:107)\n", "\tat uk.bl.wa.indexer.WARCIndexer.extract(WARCIndexer.java:449)\n", "\tat uk.bl.wa.indexer.WARCIndexer.extract(WARCIndexer.java:220)\n", "\tat uk.bl.wa.hadoop.indexer.WARCIndexerMapper.map(WARCIndexerMapper.java:91)\n", "\tat uk.bl.wa.hadoop.indexer.WARCIndexerMapper.map(WARCIndexerMapper.java:28)\n", "\tat org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)\n", "\tat org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)\n", "\tat org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)\n", "\tat org.apache.hadoop.mapred.Child$4.run(Child.java:270)\n", "\tat java.security.AccessController.doPrivileged(Native Method)\n", "\tat javax.security.auth.Subject.doAs(Subject.java:396)\n", "\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)\n", "\tat org.apache.hadoop.mapred.Child.main(Child.java:264)\n", "Task attempt_201404161414_0096_m_020825_2 failed to report status for 20000 seconds. Killing!\n", "Error: GC overhead limit exceeded\n", "2014-04-25 02:53:57 FATAL Child:318 - Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded\n", "\tat org.apache.tika.language.ProfilingWriter.addLetter(ProfilingWriter.java:82)\n", "\tat org.apache.tika.language.ProfilingWriter.addSeparator(ProfilingWriter.java:87)\n", "\tat org.apache.tika.language.ProfilingWriter.write(ProfilingWriter.java:72)\n", "\tat org.apache.tika.language.LanguageProfile.<init>(LanguageProfile.java:67)\n", "\tat org.apache.tika.language.LanguageProfile.<init>(LanguageProfile.java:71)\n", "\tat org.apache.tika.language.LanguageIdentifier.<init>(LanguageIdentifier.java:133)\n", "\tat uk.bl.wa.extract.LanguageDetector.detectLanguage(LanguageDetector.java:102)\n", "\tat uk.bl.wa.analyser.text.LanguageAnalyser.analyse(LanguageAnalyser.java:54)\n", "\tat uk.bl.wa.analyser.text.TextAnalysers.analyse(TextAnalysers.java:74)\n", "\tat uk.bl.wa.indexer.WARCIndexer.extract(WARCIndexer.java:461)\n", "\tat uk.bl.wa.indexer.WARCIndexer.extract(WARCIndexer.java:220)\n", "\tat uk.bl.wa.hadoop.indexer.WARCIndexerMapper.map(WARCIndexerMapper.java:91)\n", "\tat uk.bl.wa.hadoop.indexer.WARCIndexerMapper.map(WARCIndexerMapper.java:28)\n", "\tat org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)\n", "\tat org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)\n", "\tat org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)\n", "\tat org.apache.hadoop.mapred.Child$4.run(Child.java:270)\n", "\tat java.security.AccessController.doPrivileged(Native Method)\n", "\tat javax.security.auth.Subject.doAs(Subject.java:396)\n", "\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)\n", "\tat org.apache.hadoop.mapred.Child.main(Child.java:264)\n", "\n", "\n", "OKAY, so modified to catch OOME and move on instead of failing totally. Re-running. Got past the mappers ok. 9 hours of sorting.\n", "\n", "At 26hrs, 38mins, 50sec, we are at 34,571,153 of 275,918,077 records in the reducers.\n", "\n", "In the end, Total Time: 233hrs, 51mins, 33sec. Of 275,917,465 records, 72 errors caused 612 records tobe dropped. These errors almost exclusively occurred during the first few hours of the index build.\n", "\n", "Okay, upped to 1000 instead of 100 per submission, (was 500 before, when it seemed to work ok), and launching on the second chunk.\n", "\n", "Total: 41hrs, 34mins, 47sec\n", "Currently 29,562,045 of 275,034,327 ingested.\n", "Started at Tue May 06 15:35:33 BST 2014\n", "Sort fninshed at 7-May-2014 03:12:13\n", "i.e. c. 30 hours for one tenth of this tenth.\n", "Definately appears much slower than when processing smaller chunks.\n", "KILLING and reverting to smaller chunks - 22500 per job.\n", "\n", "So, now:\n", "7hrs, 51mins, 21sec, at 2,517,002 of 137,026,829 records.\n", " \n", "GAH.\n", "\n", "Going back to the task tracker, 127,000,000 records indexed in 21 hours!\n", "\n", "From http://194.66.232.87:50030/jobdetailshistory.jsp?logFile=file%3A%2Fusr%2Flib%2Fhadoop-0.20%2Flogs%2Fhistory%2Fdone%2Fbellie-private_1397654045202_job_201404161414_0020_anjackson_..%252Fia.files.shuf.split.aa.nofails_1397773215102\n", "\n", "Submitted At: 17-Apr-2014 23:20:57\n", "Launched At: 17-Apr-2014 23:21:01 (3sec)\n", "Finished At: 18-Apr-2014 21:02:19 (21hrs, 41mins, 18sec)\n", "\n", "So, git diffing against that time. Minor code changes, some config changes, but many Solr changes.\n", "\n", " git diff 'HEAD@{17-Apr-2014 23:20:00}' HEAD .\n", " \n", "Changes are since commit [e3189e0](https://github.com/ukwa/webarchive-discovery/commit/e3189e068838911adac9dd662e84f422f042c223). \n", " \n", "Also, is Cloud config ok? Which config set does the implementation depend on? It caches the config from the server and puts it in the distributed cache.\n", "\n", "OK, so new solrconfig.xml has this line:\n", "\n", " \n", "