Reverse search optimization

QED

Diamond Member
Dec 16, 2005
3,428
3
0
I've been tasked with a somewhat interesting project.

A customer (online retailer) of ours has hundreds of thousands of users. Each user can search for products by specifying a search term that may include wildcards. So far, trivially easy.

Now each user can save their search terms so they can run any particular search they want again in the future. Again, trivially easy.

Now-- the interesting part: the customer wants users to be notified instantaneously whenever a new product arrives that matches their saved search parameters.

Thousands of new products arrive each day... is there some better approach than a simple brute force attack of iterating through and executing each and every one the hundreds of thousands of saved search terms against each and every product as it arrives? If so, this would be murder on the databases and murder on the processors.

Perhaps break each user's saved search criteria into individual words that can be indexed, and then break each product's description/name into words as well and only execute saved searchs where at least one of the individual words is found in both the product description and the saved search criteria? This should work as long as the customer isn't expecting the wildcard to complete an actual word... for instance, if the customer search is "pizza*" the above approach would match a product name of "pizza oven" but not "pizzas"-- which it probably should.

Any ideas for optimizing the brute force approach which requires every search criteria to be evaluated against every new product?
 

Markbnj

Elite Member <br>Moderator Emeritus
Moderator
Sep 16, 2005
15,682
13
81
www.markbetz.net
I would definitely be tempted to do this in a batch process, but if they really mean "instantaneously" then that's a different animal.

One idea that pops to mind: when a user executes a search store the count of returned items in a column with the saved search terms. Have another version of the search that returns just the count. That should save some overhead. Every so often kick off a process that looks to see if any new products have been added. If so, run through the user saved searches and run the count version for each. If the number comes back different write a record somewhere for further processing.

When you're done you'll have a list of searches for which the results have changed since the new products were added. Now you'd need a second step to determine whether the count was different because one of the new products showed up on the list, or an existing product was removed.
 

Oyster

Member
Nov 20, 2008
151
0
0
Depending on your design, this may or may not work: you could partition your data? Compared to what Mark suggested, this may seem trivial, but I can't really see past this being a data partitioning task.

You may want to think your DB design thoroughly before you implement it, but if you do it right, your searches would be super-fast. For example, you could normalize (I'm no expert in this area - I'm just thinking out loud) the DB schema so you store all your A* products in one partition, all your B* products in another, and so on. Then, reference your searches in these products. When a new product is inserted, you're only going to notify the search queries that were referenced, without really impacting the rest of the DB/table. Of course, you'll have to properly index the DB, but as long as you keep it simple, I think you should be OK.


I am just thinking about this as I type, so please take it with a grain of salt. Also, I am assuming that these are simple "word" searches, not involving fuzzy matching and lookups? Because if you're, you're in for a treat .

It'd be interesting to see what others come up with.
 

KIAman

Diamond Member
Mar 7, 2001
3,342
23
81
You can flag dumb queries and never run them. Dumb queries being ones that return everything or nothing.

You can also query the table that holds all the user search terms per incoming product and only run the terms that return. This will be very fast if a product is very unique compared to other products.
 

iCyborg

Golden Member
Aug 8, 2008
1,327
52
91
Have you thought about building an inverted index for saved searches? (it has some resemblance to what you're describing)

That would be my approach if we're talking about large enough data sets, e.g. Google uses them I think.
 

Cogman

Lifer
Sep 19, 2000
10,278
126
106
Can "Instantaneously" mean "1->10 minutes later"? The brute force attack could be made vastly more efficient if it isn't ran every time a new item is submitted (It may even be faster). By just running a service every 1->10 minutes you get semi-instantaneous results. (throwing some value to signal that a new item has been added to avoid oversearching).

Going backwards just seams like a pain in the butt, not something that DBs are generally made to do. (though, the algorithm would be pretty similar to forward searching).

Another idea might be to do something like this. When someone adds a new term, search through existing search terms to see if this search term finds them. Create a tree like structure and iterate down. In other words, if search a finds search b and a new item doesn't match search a, then you know you don't need to run search b (or something like that).

I don't know if I explained that well enough. Here is what I envision. Lets say you have the search terms

piz*
pizza*
pizza oven*
pirate

piz* will match pizza* and pizza oven* If a new item comes in named pasta, run it first against piz*, if it doesn't match, you know that pizza* and pizza oven* can't match it, so don't do those searches...

I don't know, but then that opens up the problem for searches like *pizza*oven* that I don't know if you could easily resolve... Just thinking outloud here.
 

Schmide

Diamond Member
Mar 7, 2002
5,590
724
126
It seems to me you just need another search. (database)

You currently have a keyword to item search you need a keyword to user search. You would basically be producing an index that produces a list of users based on keyword usage. So for every keyword used there needs to be a list of users for that keyword. When a new item arrives you assign it a list of keywords, run it through the keyword to user index and flag every user on that list as dirty.
 

ioni

Senior member
Aug 3, 2009
619
11
81
Store the date each product is entered. Store the date of the last time each search query was run. Only search on products each query sees as new. Combined with cogman's idea, I don't think that would hit the DB too hard.

Also, this sounds like it should be an opt in option per search. I know I wouldn't want to get my mailbox hit with all that spam.
 

BoberFett

Lifer
Oct 9, 1999
37,563
9
81
It seems to me you just need another search. (database)

You currently have a keyword to item search you need a keyword to user search. You would basically be producing an index that produces a list of users based on keyword usage. So for every keyword used there needs to be a list of users for that keyword. When a new item arrives you assign it a list of keywords, run it through the keyword to user index and flag every user on that list as dirty.

This.
 

Schmide

Diamond Member
Mar 7, 2002
5,590
724
126
I'd like to add why it would be beneficial to rerun all the searches at the user level. (I.e. make the users dirty) A keyword to user search can really only have positive items (i.e. pizza) but user's searches can have negative items. (i.e. -oven) If you added an oven you wouldn't want to run a search on all the users that have -oven in their search. You could quantify whether each user receives an update when an item is added, but as you said there are 1000s of items added at a certain time. If that is the case users could receive multiple updates one for each item that meats their search criteria as they're added. It would be best to add all the new items while creating a dirty user list, then run through the list and send out the delta in one update notification.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |