| * ndushay leaves | 03:01 | |
| * bess joins | 09:20 | |
| * bess leaves | 09:55 | |
| * bess joins | 11:50 | |
| * bess leaves | 12:32 | |
| * ndushay joins | ||
| * ndushay leaves | 13:01 | |
| * ndushay joins | 13:25 | |
| * bess joins | 14:35 | |
| <erikhatcher> | anyone here that can walk me through getting SolrMarc up and running against some sample MARC data? | 15:03 |
| <bess> | erikhatcher: I'm trying to see if bob haschart is around and I"ll ask him to help you | 15:04 |
| * Naomi joins | 15:16 | |
| * ndushay leaves | 15:34 | |
| <jrochkind> | erikhatcher: I can try to help, although I'm not an expert. it's pretty straightforward. | 15:37 |
| But I might be muddling around a bit. | ||
| <erikhatcher> | jrochkind: i'm working with Bob now | |
| <jrochkind> | Sweet, Bob is obviously the expert. | |
| Anyone around that understands the Jetty/Solr that used to be packaged with demo, and I guess now is optionally installable with the template? | 15:38 | |
| apparently not. | 15:40 | |
| <erikhatcher> | all looks well with what you emailed, jrochkind | 15:48 |
| wen you said it had Solr JARS in there, i didn't think you meant "plugins" | ||
| <jrochkind> | what do you mean? I am confused. | |
| Aha. | ||
| <erikhatcher> | but those are all additional plugins | |
| <jrochkind> | Yeah, I didn't know what they were. | 15:49 |
| <erikhatcher> | solr-cell is the PDF/Word indexer stuff | |
| <jrochkind> | Aha, "solr-cell" is different than solr. I get it. | |
| So I've got a buncha plugins. Which wouldn't come with 'stock' solr? Or is there no such thing as 'stock' solr? | ||
| <erikhatcher> | it comes with stock solr actually, yes | |
| <jrochkind> | OH, okay, cool. | 15:50 |
| <erikhatcher> | except for the UnicodeNormalizeFilter i think | |
| that's bob magic | ||
| <jrochkind> | So now I'm back to my original question... I feel weird about putting my own weird customish code mixed in the same dir with 'stock solr' stuff. Is there a better way to do it, or should I just do it and not worry about it? | |
| <erikhatcher> | again, you can use solrconfig.xml to point to external JARs/dirs | |
| or use solr-home/lib | ||
| your call | ||
| or repackage the solr.war file with your JAR there :) | 15:51 | |
| <jrochkind> | Cool, cool. But it's ordinary to just put your own custom stuff in solr-home/lib, that already has 'stock' plugins in it too? | |
| That last one is the one I know I will NOT be doing. :) | ||
| I'm trying to keep my stuff _seperate_ from stock solr, not mix it in even more confusingly. :) | ||
| <erikhatcher> | i personally would recommend in solr-home/lib to keep it all together and simple | 15:52 |
| <jrochkind> | Then I'll go with that, that's what I was looking for, just advice from someoen that's been there before. | |
| <erikhatcher> | but a <lib> setting in solrconfig (see a Solr 1.4 example solrconfig.xml) to point externally is fine too | |
| again, your call | ||
| <jrochkind> | Okay, sweet. | 15:53 |
| But $solr_home/lib DOES often have 'stock' plugins in it? And people DO often just stick their own custom plugins in there anyhow? | ||
| But yeah, I guess I do like the idea of keeping it seperate. $solr_home/local-lib or what have you. Okay, groovy. erikhatcher++ | ||
| <erikhatcher> | "often" is relative... depends on the needs of the project. | 15:55 |
| Solr's *example* ships with it populated | ||
| but that is a kitchen sink | ||
| <jrochkind> | aha, I get it. | |
| thanks, I'm good. I'm totally just gonna make a local_lib dir, why not. | ||
| <erikhatcher> | if you aren't indexing PDF/Word/HTML files and such, you can get rid of most of it | 15:56 |
| <jrochkind> | yeah. But I'm going to have to go through a process of elimination to figure out what I can actually get rid of, since I'm using solrconfig/schema files "inherited" from someone else, I'm not exactly sure what it uses. | |
| <BillDueber> | jrochkind: I just benchmarked fetching the full marc record vs not. I'm not seeing a measurable difference on our hardware. | 16:47 |
| jrochkind: Hold that though. I misspelled the field name. It wasn't fetching it. | 16:49 | |
| <jrochkind> | BillDueber: I saw that. That's awesome. BUT. How hard would it be to do with a million records too? | |
| <BillDueber> | jrochkind: Nah. Doing it right now. It's still just as fast. | |
| <jrochkind> | BillDueber: Some have suggested that if stored marc IS going to cause a problem, it's not going to appear until the index is sufficiently large. | |
| <BillDueber> | jrochkind: Well, I've got 6M records. | |
| <jrochkind> | if you wanted to try it on a 1M-6M index, it might be more convincing. | 16:50 |
| <BillDueber> | jrochkind: I've got a 6M record index. I'm just sampling it. | |
| <jrochkind> | With a small index, as far as I can tell, the slowdown I'm seeing in the Rails app is purely in the HTTP fetch itself, it's inside Net::HTTP.get. Not sure what to do about that. | |
| BillDueber: Ah, the index is 6M, you're just only pulling 75K out of it? Okay, you're golden. Hooray. | ||
| <erikhatcher> | BillDueber: by just as fast, are you looking at Solr's QTime? | 16:53 |
| <BillDueber> | No, looking at jmeter throughput. But I'm suddenly not convinced I'm doing what I think I'm doing. | |
| <jrochkind> | yeah, I'd expect it to slow down at least some tiny but noticeable amount just based on the increased size of the HTTP responses generated. | 16:55 |
| <erikhatcher> | there's no way it is just as fast. pulling stored fields from a lucene index is definitely a bottleneck | 16:56 |
| <BillDueber> | erikhatcher: Yeah. I fucked up. | 16:57 |
| jrochkind, erikhatcher: OK. It's about 3/5 as fast -- 1600 requests/second | 16:58 | |
| <jrochkind> | good to know, good to know. | 16:59 |
| Alternately storing your Big Data in some external store and fetching it is going to cost time too. I'm not sure if that hits the Special Lucene Big Data penalty or not, 3/5ths doesn't seem THAT bad. | 17:00 | |
| <BillDueber> | jrochkind: Right. And in absolute numbers -- 1600/second (or 1000/second for fl=*) is still ridiculously fast compared to the rest of the app. | 17:01 |
| <erikhatcher> | a lot can depend on what other fields are stored too, how the seeking goes on disk and such | |
| BillDueber: also keep in mind that processing that chunk takes time that the user feels | 17:02 | |
| <jrochkind> | of course. BillDueber, I figure this was in an index configured for VuFind? So more or less a 'typical library OPAC type setup'. | |
| <BillDueber> | jrochkind: Right | |
| erikhatcher: Right, but it adds huge amount of flexibility. | 17:03 | |
| <jrochkind> | I mean, the more you do, the more time it's going to take, no matter what. Sure, if you can avoid having to deal with marc at all, it's gonna be faster. The less you do the faster things will be, but there's a function per fastness tradeoff. And then you optimize your bottlenecks. | |
| <erikhatcher> | BillDueber: don't get me wrong...i dig stored MARC. that was in my first Flare (pre Blacklight) implementation | 17:04 |
| <BillDueber> | I'm guessing the real bottleneck in working with the raw marc is parsing hte marc. And in that case, I can always pre-parse it and store json. | |
| <jrochkind> | So the question was/is, if you DO want marc at results list display time, is there a _special_ lucene performance penalty to getting it out of a stored field, meaning that's a bottleneck that should be optimized by storing it elsewhere. | |
| It still sounds to me like we lack actual experience/evidence that this is so. | ||
| <erikhatcher> | marc_display field was ruby-marc serialized. i loved it! | |
| <jrochkind> | BillDueber: If you're storing marc as binary, which I am, I think that should be a lot faster to parse than XML. I don't think it should be a problem. | |
| <erikhatcher> | yeah, just gonna have to profile it. it's gonna depend on lots of things | |
| <BillDueber> | jrochkind: Maybe. I'm not saying parsing marc can't be fast -- just that it's slower than getting the marc data out of solr. | 17:05 |
| <jrochkind> | Yep, profile and optimize the bottlenecks. But I'm thinking that at this point there's no reason to predict the solr/lucene stored field is gonna be the bottleneck to attend to. | |
| <erikhatcher> | saying stored fields are "slow" in Lucene is only saying slow*er* than just pure search | |
| hasn't it been shown to be a bottleneck for UVa and Stanford? | ||
| <BillDueber> | erikhatcher: It's been asserted... | 17:06 |
| <jrochkind> | I am skeptical that's been shown. | |
| <erikhatcher> | but MARC isn't all that big (compared to what many do with Solr and large docs) | |
| anyway, i'm all for stored marc until it is shown to be the problem | ||
| <BillDueber> | UVA was storing marc-xml and using ruby-marc with REXML to parse it. Which is *ridiculously* slow. | 17:07 |
| <erikhatcher> | ewww, rexml-- | |
| <BillDueber> | With the new libxml or jstax parsers, that bottleneck is less of an issue. | |
| <erikhatcher> | i was using ruby-marc serialized, not XML | |
| was plenty fast in my initial work | ||
| <BillDueber> | erikhatcher: Which is smart. I've also advocated using a marc-json format, because we'rve got a few languages floating around here that might want to hit our solr install. | |
| <erikhatcher> | sounds like a good idea | 17:08 |
| <jrochkind> | i don't think anyone _profiled_. they just benchmarked. so it's a guess as to WHERE the bottleneck was, and if it's still there. | |
| <BillDueber> | Anyway, I gotta go get my kids. But all I can say is -- we store marc-xml and parse it out every time, and it's fast enoug for us. | |
| ...and screw the rest of you :-) | ||
| <erikhatcher> | fyi - i'm now indexing (via solrmarc over http) the big talis open collection of marc. letting it run while i duck out for the evening. let's see how this puppy runs and then i'll work on optimizing it | |
| <BillDueber> | G'night all! | |
| <jrochkind> | :) | |
| BillDueber++ | 17:09 | |
| <erikhatcher> | BillDueber++ | |
| <BillDueber> | erikhatcher: We're waiting around here with baited breath for your input. I'ts been the blind leading the blind wrt solr indexing in both blacklight and vufind communities, I think. | |
| Wait. Baited? Bated? | ||
| either way. We're excited. | ||
| erikhatcher++ | ||
| ...and I'm gone. | ||
| <jrochkind> | erikhatcher: we love you. | 17:12 |
| * bess leaves | 18:11 | |
| * bess joins | ||
| * jamieorc leaves | 19:02 | |
| * bess leaves | 19:03 | |
| * ndushay leaves | 20:46 | |
| * ndushay joins | 21:32 | |
| * bess joins | 22:40 | |
| * bess leaves | 22:51 | |
| <erikhatcher> | ok, so update on my marc indexing.... | 23:12 |
| i grabbed the biggest file from the talis collection | ||
| ran trunk solr pointed to the solrmarc solr config | ||
| ran the indexer against | ||
| it | ||
| started it 6 hours ago | 23:13 | |
| went to a bar, heard a great band play | ||
| still running | ||
| hmmm | ||
| we can do better! | ||
| * bess joins | 23:34 | |
| * bess leaves | 23:42 | |
| <erikhatcher> | *still* indexing... i'm not sure how many records are in there | 00:35 |
Generated by Sualtam