Discover Top Posts Tagged with #query answering

Things you can't do in first-order logic

When it comes to first-order logic, you can't always get what you want--especially if what you want is to answer higher level queries over a database.

But if you try sometimes well you might find etc.

(by the way, I'll be drawing heavily on the first chapter of Libkin [2012])

\[\ast\ast\ast\]

There's a small piece of the universe you're interested in and you have thorough knowledge on, the railway network (because you're German).

So you set up a database where you represent the connections in the railway network. You think about what sort of language to use to represent this knowledge, and go for first-order logic (because you're German).

The stations are: \(\mathsf{berlin}\), \(\mathsf{dortmund}\), \(\mathsf{hamburg}\), \(\dots\), and the relevant relations: \(\mathsf{Station}(x)\), \(\mathsf{Connection}(x,y)\), and that seems enough so let's keep to this vocabulary.

With sturdy optimism you now fill out the database with facts:

\[\mathsf{Station(berlin)}. \\\ \mathsf{Station(dortmund)}. \\\ \mathsf{Station(hamburg)}. \\\ \mathsf{Connection(berlin, hamburg)}.\\\ \dots\]

\[\ast\ast\ast\]

Very nice, but now you want to retrieve information from this database and allow others to retrieve information from it (because).

You set up some basic mechanism for query answering. Since you used first-order logic to represent the domain (the railway network), it seems natural that queries should also be first-order logic formulas and that they should be evaluated in the most intuitive way, according to the usual semantics.

For instance, when you ask

\[\mathsf{Station(berlin)} \wedge \mathsf{Station(hamburg)} \wedge \mathsf{Connection(berlin, hamburg)}\]

you'll get the answer \(\mathsf{yes}\). When you ask

\[\mathsf{Connection(berlin, dortmund)}\]

you'll get the answer \(\mathsf{no}\). And so on.

Which isn't bad, but you want to market your contraption. So the public needs to be able to ask more interesting queries--not just \(\mathsf{yes}\) or \(\mathsf{no}\) queries, but stuff that asks for more detailed information. For instance:

Where can I go from Berlin?

What connections are there from Berlin to Munich?

What is the shortest connection between Berlin and Munich?

That's ambitious. The first step is to allow your queries to admit sets of objects as answers: not only if there are connections from Berlin but also which are they.

This you do by putting free variables in your queries, for instance

\[q(y) = \mathsf{Connection(berlin, y)}\]

The left-hand side, \(q(y)\), showcases the free variable, which in this case is \(y\). The query reads, roughly:

"the y's such that Berlin is connected to y",

or

"stuff that Berlin is directly connected to".

The answer to \(q\) would be something like:

\[\{\mathsf{hamburg}, \dots\}\]

\[\ast\ast\ast\]

If you want to find pairs of stations where there's a direct connection from the first to the second, you write the query:

\[q_0(x,y) = \mathsf{Connected}(x,y)\]

Here \(x\) and \(y\) are the free variables, and the answer will be something like:

\[\{(\mathsf{berlin}, \mathsf{hamburg}), \dots\}\]

If you want to find stations that have exactly one intermediate stop between them, you write:

\[q_1(x,y) = \exists y_1 \mathsf{Connection}(x, y_1) \wedge \mathsf{Connection}(y_1, y) \wedge (x \neq y_1) \wedge (y \neq y_1)\]

If you want to find stations that have two intermediate stops you write:

\[q_2(x, y) = \exists y_1 \exists y_2 (\mathsf{Connection}(x, y_1) \wedge \mathsf{Connection}(y_1, y_2) \wedge \mathsf{Connection}(y_2, y) \wedge \\\ \bigwedge ((x \neq y_i) \wedge (y_i \neq y_j) \wedge (y_j \neq y))) \]

And so on.

If you want to find stations that are connected either directly or by an intermediate stop, you write

\[q_{\leq 1} = q_0(x, y) \vee q_1(x,y)\]

If you want to find stations that have 2 or less intermediate stops you write

\[q_{\leq 2} = q_{\leq 1} \vee q_2\]

And so on. In general, if you want to find stations that have \(k\) or less intermediate stops, you write:

\[q_{\leq k} = q_{\leq (k-1)} \vee q_k\]

(notice the recursive step)

\[\ast\ast\ast\]

But what if you just want to find a generic connection, i.e. pairs that are connected somehow, with some number of intermediate stops between them (not a specific one)? What do you do then?

This is where things get hairy. Because what you'd want to ask is something like the following query:

\[\bigvee_{i \in \mathbb{N}} q_i\]

This is an infinite disjunction, so not really first-order logic material. And, it turns out, this query isn't expressible as a first order logic formula at all (I'll show a proof later).

\[\ast\ast\ast\]

Suppose you want to know if, among all the connection chains you can form, there is one which starts and ends with the same city. In other words, if you represent the \(\mathsf{Connection}\) relation as a graph, is there a first-order formula which gives you the nodes connected by a cycle?

Turns out this isn't expressible either.

\[\ast\ast\ast\]

Take \(|y|\) to be the number of stations connected to \(y\), i.e.

\[|y| = |\{x \mid \mathsf{Connected(x, y)}\}|\]

And suppose you want to find those pairs \((y_1, y_2)\) such that \(|y_1| = |y_2|\). Can you write a first order query that expresses this?

Turns out no.

\[\ast\ast\ast\]

"This is bullshit."

First-order logic has these limitations on what you can express in it. You can't express recursion, you can't encode cardinality of sets--all kinds of neat things you have in, say, second-order logic.

"Why does this matter?"

Databases are important in today's society. A lot depends on information being available where you put it, fast and in whatever form needed.

"Boring."

It's a multi-billion dollar industry.

"Tell me more."

Query languages such as SQL (which at their core are variants of (a fragment of) first-order logic) implement special features for the kinds of things that are of practical interest, e.g. counting--you can ask for the number of individuals satisfying a certain property, and the SQL engine on your server will give it to you.

The interesting questions for database people are: what can you add (and what do you need to add) to a basic language like first-order logic (or some variant/fragment of it) to make it more expressive?

This is more complicated than it seems, because expressiveness usually has to be balanced out with computational complexity. Roughly put, the more expressive a language is the more complex it gets to perform operations in it.

For instance, one basic operation is query answering--given a query, to compute the answer. You want your database system to be able to do at least that, and to do it well (i.e. fast). No use in having a very expressive language if your database management system takes a million years to answer interesting queries in that language.

In the sections above I kind of glossed over this issue, but you can imagine that in a real world scenario query answering requires looking through the database and doing some checking. And especially if your database is large, this operation has the potential of getting very very consuming.

So for real applications what you need to do is balance expressiveness with complexity of performing interesting tasks. You need to understand trade offs that can appear.

\[\ast\ast\ast\]

"Why are you telling me all this?"

I want to show you the proof that the generic connection property for stations (or, more accurately: the property that there's a path of some length \(k \in \mathbf{N}\) between two nodes of a graph), that this property isn't expressible as a first-order formula.

#query answering

Trending Tags

Recently Viewed Tags

#query answering