|
Subject: htdig with omega for multiple URLs (websites) Newsgroups: gmane.comp.search.xapian.general Date: 2006-03-29 17:41:31 GMT (3 years, 13 weeks, 6 days, 3 hours and 27 minutes ago) Olly, many thanks for suggesting htdig, you saved me a lot of time. Htdig looks better than my original idea - wget, you were right. Using htdig, I can crawl and search single website - but I need to integrate search of pages spread over 100+ sites. Learning, learning.... Htdig uses separate document database for every website (one database per URL to initiate crawling). Htdig also can merge result databases to allow search of integrated results. If you still have around the script you said you wrote to use htdig as crawler front-end for omega, I would be really interested to see it. My htdig crawls single site. I need to learn how to crawl multiple sites and merge results. Do you recall your htdig2omega script handling this merging? Or you searched one htdig-crawled database? Or can I merge using htdig and then search using omega? Thanks for any insight which way to start looking. Also if anyone on list has experience with using htdig to crawl multiple websites, I would really appreciate insight or sample scripts. My current approach would be 1) generate 100+ config files (one per URL), creating 100+ databases 2) generate script to merge results. Is there a better way? -- Peter Masiar, Yale center for medical Informatics A: Because it messes up the flow of reading. Q: Why is top-posting often frowned upon? |
|
|