Search Computing
Challenges and Directions

Stefano Ceri, Marco Brambilla (Eds.)

Springer LNCS, Vol. 5950, March 2010.

[SeCo Book Intro]
[Table of Contents]
[Book preview]

Appendix A

Search Computing Dictionary

by the SeCo Team

Politecnico di Milano, Dipartimento di Elettronica ed Informazione,
V. Ponzio 34/5, 20133 Milano, Italy

1. Service Framework

Service mart:
Data abstraction modeling → data sources referring to the same kind of “Web object” (e.g., hotels, movies, etc.). Service marts are described at three different levels: a higher level describing the → service mart signature and → connection patterns, an intermediate level describing all the → service interfaces available for the → service mart signature, and a lower level describing the → service implementation of each service interface.
Data source:
A ◊ data repository that can be accessed programmatically, such as a ◊ database, an ◊ application program, a ◊ Web Service, a ◊ wrapper extracting data from a ◊ Web site, etc.
Service mart signature:
A characterization of a → service mart in ◊ relational terms. It consists of the name of the service mart and the set of → attributes of the service mart.
Attributes of a service mart:
Attributes characterizing the different ◊ fields that are handled by the → service mart to describe properties of a Web object. Each attribute is associated with a → built-in type and possibly with a set of → keywords. The set of attributes of a service mart can be either single-valued or multi-valued.  Multi-valued attributes can be put together to form a → repeating group of attributes.
Repeating group of attributes:
A set of multi-valued attributes of a → service mart that collectively define one ◊ property of a Web object. Within the same Web object, the attributes of the same repeating group all have the same number of values.
Built-in type:
Any concrete type of a programming language supporting Search Computing, such as int, string, float, double, date, …
Keyword:
A term characterizing the semantics of an attribute. Keywords are normally taken from vocabularies, e.g. ◊ WordNet.
Service interface:
A characterization of one of the possible → service implementations of a → service mart given by one ◊ URI, one set of → Service Parameters, and one → access pattern. The same → access pattern is shared by possibly more than one service interface. Every service interface has exactly one service implementation.
Adornment:
A pair consisting of an → access pattern and a → scoring function associated to a → service interface. The same adornment is shared by possibly more than one service interface.
Adorned service mart:
The association of a → service mart with an → adornment existing in at least one → service interface of the service mart.
Service implementation:
A concrete ◊ implementation of one → service interface, thus providing operations for accessing data available at one or more → data sources according to the → access pattern of the service interface.
Access pattern:
An labeling of a → service mart signature that determines which attributes of the → service mart are ◊ output and which ones are ◊ input attributes. Among ◊ output attributes, some may be labeled as ranked and associated with a → score, and in that case the service implementation returns its results in an order which depends on them. There can be several → service interfaces for the same → access pattern.
Scoring function:
Function mapping each tuple output by a → service implementation into a → score. The mapping can be determined based on the → attributes of the service mart of the → service interface associated with the scoring function, or the position of the tuple in the output, and depends on the → ranking type. Note that if the → ranking type is unranked, then the scoring function is a fixed constant (e.g., 1).
Connection pattern:
Specification of the join between two → service marts, expressed as a conjunctive expression of join predicates using the attributes of the service marts.  Every join predicate is built by a pair of → service mart attributes, orderly taken from the first and second → service mart, and an arbitrary ◊ comparison operator.
Service mart design:
Process of defining the signature of a → service mart, then its →access patterns, then the  ◊data sources that can support queries expressed with those →access patterns, each of which is registered as a →service interface. The process is completed by defining connection patterns between pairs of service marts.
Service mart implementation:
Production of a → service implementation for a given → service interface. The process may use components and tools for data materialization, extraction, conversion, and translation.
Service registration:
Addition of a → service implementation and its → service interface to a → service mart. With the service registration, the service implementation is made available at the ◊ URI specified in the service interface.
Ranking type:
A set of properties regarding the ranking, i.e., the order in which the results of consecutive → fetches performed on a → service implementation are returned. The ranking type defines:

  • whether a → service implementation is ◊ ranked or ◊ unranked,
  • whether it is opaque or visible (i.e., the → attributes of the service mart include a → score),
  • the set of output attributes the ranking depends on (the set is empty if it does not depend on output attributes)
  • the → scoring function mapping the set of output attributes the ranking depends on into a → score, and
  • whether the ranking is ◊ ascending or ◊ descending.

Search Service:
Service implementation that returns → chunks of ranked results that are possibly ◊ unbounded in number. The corresponding ranking type is ◊ ranked.
Exact Service:
Service implementation whose corresponding ranking type is ◊ unranked. The results returned by such a service implementation are ◊ finite and not → chunked, except for technical reasons.
Service parameter:
Meta data describing the → service implementation of a → service mart, being one of → ranking type, cacheable, → cache time-to-live, → isChunked, → chunk size, → ERSPI, → decay factor, → response time, and → cost.
Fetch:
Request for the next → chunk of results from a → service implementation.
Chunked:
Property of a service implementation whose results are organized in → chunks.
Chunk:
Any page of results in the form of ◊ tuples returned by a → service implementation.
Cacheable:
Parameter of a → service interface. True if the results returned by a → service implementation can be stored in a ◊ cache; false otherwise.
Cache time-to-live:
Parameter of a → service interface. Duration of the validity of ◊ cached data.
isChunked:
Parameter of a → service interface. True if the → service implementation is → chunked; false otherwise.
Chunk size:
Parameter of a → service interface. The number of ◊ tuples in a → chunk.
ERSPI (expected result size per invocation):
Parameter of a → service interface. ◊ Positive real number expressing the expected size of the result produced by a → service implementation when called with arbitrary input values; in ◊ relational terms, expresses the ◊ average cardinality of the result of a query. The ERSPI is not very significant with search services, as the complete result may consist of a very high number of ◊ tuples, which however are not completely retrieved by fetch operations.
Selectivity:
Positive real number expressing the ratio between → ERSPI of a → service implementation and the total number of tuples that can be returned by the → service implementation.
Decay function:
Parameter of a → service interface. A function that models the decay of the → scores of the results returned by consecutive → fetches.
Response time:
Parameter of a → service interface. Average ◊ response time of a → fetch.
Cost:
Parameter of a → service interface. ◊ Monetary cost of a → fetch.
Score:
Value in the [0,1] interval indicating the quality (1=highest, 0=lowest) of a ◊ tuple according to a predefined criterion as specified by a → scoring function.

2. Query Expression and Optimization

Query on service marts:
A query is a ◊ graph whose nodes are labeled with → service marts and whose edges are labeled with → connection patterns. Nodes are additionally labeled with ◊ selection predicates over the → service mart attributes and ◊ projected attributes forming the result. The same → service mart can appear multiple times in the same query. Every query is interpreted as a ◊ conjunctive query, whose ◊ join predicates are built from →connection patterns.
Query on adorned service marts:
A query is a ◊ graph whose nodes are labeled with → adorned service marts and whose edges are labeled with → connection patterns. Additional labeling is as for →queries on service marts. A →query on service marts can be turned into a query on adorned service marts simply by substituting to every →service mart labeling a query node one of the corresponding → adorned service marts.
Feasible query:
A →query on adorned service marts is feasible if the →access patterns used in the query satisfy given well-formedness conditions guaranteeing that it is possible to compute the results as if access patterns were not present.
Query on service interfaces:
A →query on adorned service marts can be turned into a query on service interfaces simply by substituting to every → adorned service mart labeling a query node one of its →service interfaces. Recall that every → service interface is also associated with a → service implementation.
Query formulation:
The process of specifying → feasible queries on → service interfaces and possibly a → ranking function associated with the query. Suitable interfaces assist users in formulating only →feasible queries.
Query processing:
The process of executing → feasible queries on service interfaces, producing a → query result. The process consists of (1) query installation (2) query optimization (3) query execution (4) user interaction for interpreting results, possibly leading to further steps of query processing.
Query result:
A ◊ list of → combinations.
Combination:
A ◊ tuple of the → query result, built by the matching ◊tuples produced during → query execution, where a query result ◊ tuple includes exactly one ◊ tuple from the results produced by each → service implementations involved in the query execution. Combinations in the query result are ordered according to a → ranking function associated with the query.
Ranking function:
A ◊ weighted sum of the individual → scores of the ◊ tuples forming a → combination, which returns the ranking of the → combination according to the users’ preferences.
Query optimization:
The process of building query plans for → correct queries, and then selecting the → optimal query plan which is then executed.
Query plan:
A ◊ directed acyclic graph made of → nodes of a query plan and → edges of a query plan, representing an operational specification of the order in which → service interfaces are to be invoked in order to retrieve a specified number of → chunks, and of how → joins are to be performed between → chunks according to specific → join strategies.
Node of a query plan:
Any node in a → query plan, representing the → initial node, or the → final node, or an → invocation node, or a → selection node, or a → join node.
Edge of a query plan:
A directed edge in a → query plan, representing the data transfer between two → nodes of a query plan
Initial node:
A → node of a query plan with no incoming → edges and one or more outgoing → edges, used to denote the user input. It is the only node in the → query plan without incoming edges, and every → query plan must have exactly one initial node.
Final node:
A → node of a query plan with no outgoing → edges and one or more incoming → edges, used to denote the production of the query result. It is the only node in the query plan without outgoing edges, and every → query plan must have exactly one final node.
Invocation node:
A → node of a query plan that represents the invocation of a → service interface.
Selection node:
A → node of a query plan associated with a ◊ selection operation.
Join:
An operation between two → service interfaces, called join operands, that uses → connection patterns in order to form → combinations of the ◊ tuples returned as results of service invocations.Joins are classified as → pipe joins or as parallel joins, depending on the relationships between the → access patterns associated with the join operands.
Pipe Join:
A configuration of two invocation nodes A and B in the → query plan such that A and B are connected by a directed path from A to B, and one or more output attributes of A contain values which are used as input in B (i.e., attributes are labeled as input in the → access pattern of B).
Pipe node:
The destination node in the directed path of a →pipe join configuration.
Parallel Join:
A configuration of one join node and two antecedent nodes A and B in the → query plan such that A and B correspond to service interfaces and neither the output attributes of A contain values that are used as input in B nor the output attributes of B contain values that are used as input in A.
Join node:
A→ node of a query plan with exactly two incoming → edges of a query plan and one or more outgoing → edges of a query plan, used to denote a → parallel join.
Join strategy:
Defines the order in which → chunks, received at a join or pipe node, are to be considered in order to produce → combinations (i.e. join results). Different join strategies correspond to different orders in which the points in the → join search space are to be considered. Join execution strategies can use the → nested loop or → merge scan strategies and perform a → rectangular exploration or a → triangular exploration.
Nested Loop:
A → join strategy in which all → chunks from one service are retrieved (and possibly cached) before proceeding to retrieve the next → chunk from the other service.
Merge Scan: 
A → join strategy in which → chunks from the two services being joined are alternatively retrieved in a fixed alternated order. 
Rectangular Exploration:
An exploration of join combinations in which all available candidate combinations are considered as soon as new chunks are made available. The rectangular strategy can be applied to both → merge scan and nested loop.
Triangular Exploration:
An exploration of join combinations in which only the one half of the available candidate yielding to better rankings are considered as soon as new chunks are made available. The triangular strategy can be applied to both → merge scan and nested loop.  
Cost model:
A selection of the relevant → service parameters and mathematical relationships thereof for assessing the execution cost of a query, representing a specific objective to guide the optimization process.
Cost function:
A mathematical function of some chosen → service parameters whose result is a measure of the → cost of a query plan.
Cost of a query plan:
The result of applying a specific → cost function with a specific → cost model to a specific → query plan.
Optimal query plan:
The → query plan whose → cost is minimal among all possible plans for a given query.

3. Query Execution

Execution plan:
Directed graph consisting of nodes in the form of → scheduler units and edges in the form of either → data flow edges and → control flow edges. An execution plan represents the physical evaluation of a → query plan.
Scheduler unit:
A node of the → execution plan that has a well defined semantics in terms of an operation that it performs within the execution plan. → producer Unit and → consumer Unit.
Producer unit:
A scheduler unit that produces output data, i.e. a publisher. → service Invocation unit, → (parallel) join unit, → ranker unit, →  chunker unit, → selection unit,  and → cache unit.
Consumer unit:
A scheduler unit that consumes input data, i.e. a subscriber. → service invocation unit, → (parallel) join unit, →  ranker unit, chunker unit, → selection unit, and → cache unit.
Control flow edge:
An edge of the execution plan that transmits a control signal from one unit to another. Signals are of three kinds: → pulse signals, → suspend / resume signals.
Data flow edge:
An edge of the execution plan that transports data, → chunks of tuples, from a → producer unit to a → consumer unit.
Clock unit:
Controls n > 0 → service invocation units using → pulse signals which are produced at each → clock cycle. A clock unit is controlled by a → clock function and adjusted by → clock modifiers.
Pulse signal:
Pulse signals are emitted by the clock and trigger → service invocation units to fetch data. They contain the specification of the number of → fetches to be performed in each → clock cycle. A pulse signal can as well be produced by a data → producer unit when it first produces a tuple in output.
Clock frequency:
Parameter of the → clock unit that describes how many → clock cycles per time unit a clock has. Example: 1/50 s means 50 clock cycles per second.
Clock cycle:
The period during which the clock triggers the services according to current n-tuple of the → clock function. The number of clock cycles per time unit is determined by the → clock frequency.
Clock function:
A clock function is a regular expression that describes, for each → service invocation unit controlled by the clock and for each → clock cycle, the number of → fetches to be performed by the Service Invocation Unit. A regular expression for a clock function maps into a sequence of clock values given as n-tuples of ◊ integer numbers, where n is the number of Service Invocation Units controlled by the → clock unit. As an example, (1,1)(2,2)* represents a sequence in which two services are invoked once in the first clock cycle, and twice in any subsequent clock cycle. Each clock cycle of a clock function is expected to control the → join execution strategy in terms of → chunks produced.
Clock modifier:
A clock modifier for a → clock unit is an ◊ n-tuple of ◊ positive real numbers, where n is the number of → service invocation units controlled by that clock unit. During query execution, clock modifiers are used to adjust the number of → fetches for each Service Invocation Unit controlled by the clock unit.
Suspend/resume signals:
A suspend signal is sent by the → parallel join unit to the → clock unit once the → skew factor of the join unit has been reached; it is also sent when a given number k of results have been produced. On receipt of a suspend signal, the clock suspends execution only at the end of a clock cycle. As soon as the → join execution strategy realigns, a resume signal is sent to the clock to continue query execution. Suspend and resume signals can as well be produced by → end users, the former occur when they want to suspend the execution of an execution plan, e.g. because they are temporarily satisfied with the current results, the latter when they want to resume execution of an → execution plan.
Chunker unit:
A → scheduler unit that, based on the specification of a → chunking function, constructs new → chunks with the tuples it gets as input. Internally, the chunker unit breaks up the chunks in input and produces new chunks in output.
Chunking function:
A specification, given in the form of a regular expression, of how many tuples of the input are to be combined into every chunk in output. Example: 4^3, 3* means: first produce 3 chunks with 4 tuples, then continue with chunks of 3 tuples.
Service invocation unit:
A service invocation unit fetches data by using a specific → access pattern of a given service mart. Service invocation units can either invoke → search services or → exact services. Service invocation units are triggered by a → pulse signal or by the availability of input data. When a service invocation follows one or more service invocations in the execution plan, it implicitly computes a → pipe join.
Parallel Join Unit:
It joins the results of two search service calls; executing a join causes the production of result tuples, called → combinations, according to a → join execution strategy. Joins are associated with a → Skew value and can emit → suspend/resume Signals to the clock that controls the → service invocation units generating the data for the join unit.
Skew:
In the case of runtime service behavior diverging from the expected one, → parallel join units are allowed to deviate from their predefined → join execution strategy. The maximum permitted deviation is specified by the skew value. It states how many → chunks on the x or y axis can be processed, in addition to the planned ones, before the join unit sends a → pause signal to its clock. The signal stops the production of new chunks and permits the join unit to wait for already fetched chunks to propagate through the → query execution plan.
Cache unit:
Cache units store the output generated by → producer units and make it available to → Consumer Units. Therefore they serve as a common point of synchronization in the producer/consumer (or publish/subscribe) model. Inserts of data that is already cached have no effect. Caches may trigger their consumers whenever new data is available.
Ranker unit:
Ranks or re-ranks tuples in → chunks according to a → ranking function. It has a blocking behavior, which is controlled by the specification of the ranking function. More precisely, the unit re-ranks tuples up to a certain size or time limit, after which it outputs the ranked tuples and resets itself. It may therefore work as a synchronization point when receiving tuples from more than one unit.
Selection unit:
Performs selections (consisting of ◊Boolean expressions of ◊ selection predicates) over the dataflow ◊ tuples in input; only ◊ tuples satisfying the selection are produced in output.

4. Liquid Queries

Resource graph:
Graph showing to → expert users nodes labeled with → service marts or → service interfaces and arcs labeled with → connection patterns; → expert users build queries by means of graphic interfaces, with a ◊ mash-up style. Query formulation tools include facilities for selecting nodes and arcs, selecting result attributes, describing the rank function, and describing possible query augmentations.
Expert user:
Knows about resources and their semantics, assembles queries and defines → query interfaces for personal use but also for saving them and making them available to → end users.
End user:
Interacts with query by submitting values into query forms, possibly within slots which are well-defined in terms of semantics and with expected data types (for matching with given → attributes of service marts).
Liquid query:
Query whose formulation is flexible and fluid, with the purpose of discovering the user intent while submitting queries and reading results, in a more precise manner, through a step-by-step approach that allows the user to get closer and closer to the desired information. A liquid query consists of an initial → query interface, may support several → query augmentations, and produces a → liquid result.
Query expansion:
Possibility of adding to a query a simple sub-query, consisting of joining one of the → service interfaces already in the query to another → service interface, connected to it though a → connection pattern. The process is recursive and can be applied to the → result page or to a subset of the → result combinations shown in the → result page
Liquid result:
a set of results of a liquid query, which in turn is flexible thanks to a set of result interaction primitives that allow the end-user to modify the result structure and set of result instances, possibly shown partitioned in → result pages.
Query interface:
A ◊ user interface that is offered for formulating the initial query.  It consists of a set of → search service forms, which in turn are composed by → search fields.
Result instance:
Result tuple of a → liquid query, that complies with the corresponding → result structure.
Result page:
Set of → result instances that are shown together within the → liquid result interface.
Result structure:
The ◊ schema of the result of a liquid query. It consists of a set of typed attributes. See also → Attributes of a Service Mart.
Search fields:
Input fields that compose a → search service form.
Search service form:
Web form that allows user to submit parameters required by a search service in a → query interface.
Result interaction primitive:
User command enabled within → liquid results to refine, modify, extend the result itself (either in terms of → result structure or → result instances). Result interaction primitives are listed at the end of this dictionary and are marked with the symbol §.
§ Group:
Collecting in a group the results having common values for a specified attribute. The group assumes as a title the attribute value at hand. By applying this operation, all the clustering and sorting options possibly defined by the user are applied separately to each group.

§ Ungroup:
Removing the existing grouping.

§ Cluster:
Clustering adjacent tuples on a specific attribute and hiding duplicate valuesDefining a new grouping on the results.
§ Uncluster:
Removing a clustering on the results.
§ Sort:
Sorting the results currently being displayed according to one or more different attributes with respect to the ones initially defined. Sorting can be ascending or descending.
§ Unsort:
Removing a sorting on the results.
§ Drill-down:
Visualization of (already available but currently hidden) additional attributes .
§ Roll-up:
Hiding some attributes that are currently visible (and accordingly removing duplicates).
§ Filter:
Reducing the number of shown results based on a simple condition on one attribute.
§ DeleteInstance:
Locally deleting one instance from the currently displayed items. This can be seen as a particular case of filter operation.
§ RemoveFilter:
Canceling a filter currently applied.
§ More (all):
Asking for more results from the same query, by extracting results from every involved → service mart.
§ More of one service:
Asking for more results on a limited set of → service marts involved in the query.
§ Expand:
Asking for additional results (coming from other → service marts, connected through predefined → connection patterns) on a limited set of items listed in the current result set of the query.
§ Change ranking:
Ranking the results of the query according to a different ranking function wrt the one initially defined. This might produce a different set of results than those displayed before re-ranking.
§ Pivoting:
Clustering together all instances with the same multivalue attribute value or repeating group value, thereby rendering the other attributes as repeating groups.
§ ChangeProvider:
Changing the provider of a specific → service interface.

 

 

[SeCo Book Intro]
[Table of Contents]
[Book preview]

 
 
   
    My Home