java - Multiple query on DBpedia endpoint for movie information retrieval using Apache Jena -
i'm trying download films' information(year of production , title) using apache jena , querying dbpedia public endpoint. know public endpoint has security restrictions , reason doesn't grant use query return more 2000 rows in results set. reason, i've tried subdivide query in multiple query using limit , offset option appropriately , java program (http://ideone.com/xf0gce) i'll save them on specific file in formatted manner:
public void moviequery(string dbpediafilms) throws ioexception { string includenamespaces = "prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n" + "prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\n" + "prefix dcterms: <http://purl.org/dc/terms/>\n" + "prefix dbpedia-owl: <http://dbpedia.org/ontology/>\n"; string currquery = includenamespaces + "select distinct ?movie (str(?movie_title) ?title) (str(?movie_year) ?year) {\n" + " ?movie rdf:type dbpedia-owl:film.\n" + " ?movie rdfs:label ?movie_title.\n" + " ?movie dcterms:subject ?cat .\n" + " ?cat rdfs:label ?movie_year .\n" + " filter langmatches(lang(?movie_title), \"en\") .\n" + " filter regex(?movie_year, \"^[0-9]{4} \", \"i\")\n" + " } limit 2000 offset "; int totalnumberoffilms = 77794; int totnumquery = 39; int offset = 0; int currnum = 0; for(int = 1; <= totnumquery; i++) { try { query query = queryfactory.create(currquery + offset); currnum += utils.serializemappinglist(getmoviemappinglist(query), dbpediafilms); } catch (exception ex) { ex.printstacktrace(); throw ex; } offset += 2000; mywait(30); } system.out.println(currnum); }
this query use in order retrieve information need:
select distinct ?movie (str(?movie_title) ?title) (str(?movie_year) ?year) { ?movie rdf:type dbpedia-owl:film. ?movie rdfs:label ?movie_title. ?movie dcterms:subject ?cat . ?cat rdfs:label ?movie_year . filter langmatches(lang(?movie_title), "en") . filter regex(?movie_year, "^[0-9]{4} ", "i") } limit 2000 offset $specific_offset
as can see in java code, increments variable (offset) 2000 in order acquire correct results set partition.
running preliminar query, i've seen total number of distinct films in dbpedia 77794 using query:
select distinct count(?film) { ?film rdf:type dbpedia-owl:film. }
the problem if count number of obtained nodes, equal 76000 think i've missed lot of films using procedure. can me how can correctly whole results set? forced query local dbpedia dump in order correctly results?
thank lot in advance.
edit: i've created new query using useful suggestion @joshua taylor:
select distinct ?movie (str(?movie_year) ?year) (str(?movie_title) ?title) { ?movie rdf:type dbpedia-owl:film. movie rdfs:label ?movie_title. filter langmatches(lang(?movie_title), \"en\") . optional { ?movie dbpprop:released ?rel_year } optional{?movie dbpedia-owl:releasedate ?owl_year} optional {?movie dcterms:subject ?sub. ?sub rdfs:label ?movie_year_sub filter regex(?movie_year_sub, ".*[0-9]{4}.*", "i") } bind(coalesce(?owl_year, ?rel_year, ?movie_year_sub) ?movie_year) } group ?movie limit 2000 offset $specific_offset
using group specification, virtuoso endpoint let me correct results set doesn't have duplicates row. instead when try run query using apache jena, i'm not able execute because receive following error:
com.hp.hpl.jena.query.queryparseexception: non-group key variable in select: ?movie_year in expression str(?movie_year)
there more films satisfy original query, , query doesn't count each movie once. there's big difference between select distinct (count(?var) ?nvar) …
, select (count(distinct ?var) ?nvar) …
. first shows distinct counts, whereas second shows number of distinct bindings.
you can more 1 result row each movie. in part of query:
?movie rdf:type dbpedia-owl:film. ?movie dcterms:subject ?cat . ?cat rdfs:label ?movie_year . filter regex(?movie_year, "^[0-9]{4} ", "i")
you'll result row each matching label of each category movie belongs. e.g, if film in categories 1984's worst movies , 2010 film remakes, you'll 2 result rows.
there legitimate films won't counting, because films might not have english movie title or category begins year.
i'm not sure whether you'll able entirely satisfactory results, since appears dbpedia doesn't reliably have data want. said, try query started. all films, , and (hopefully) pull out enough information dates in many cases. dbpprop:released values strange though, , don't know how useful they'll you.
select * { ?film dbpedia-owl:film optional { ?film dbpprop:released ?released } optional { ?film dbpedia-owl:releasedate ?releasedate } optional { ?film dcterms:subject [ rdfs:label ?catlabel ] filter( regex( ?catlabel, "^[0-9]{4}.*films", "i" ) ) } } order ?film limit 100
update after new query
the query you've posted doesn't work jena (because it's not legal sparql, though virtuoso accepts it) can fixed in few different ways, depending on want. simplest, direct way, not group on anything.
select distinct ?movie (str(?movie_year) ?year) (str(?movie_title) ?title) { ?movie rdf:type dbpedia-owl:film. ?movie rdfs:label ?movie_title. filter langmatches(lang(?movie_title), 'en') optional { ?movie dbpprop:released ?rel_year } optional { ?movie dbpedia-owl:releasedate ?owl_year} optional { ?movie dcterms:subject ?sub. ?sub rdfs:label ?movie_year_sub filter regex(?movie_year_sub, ".*[0-9]{4}.*", "i") } bind(coalesce(?owl_year, ?rel_year, ?movie_year_sub) ?movie_year) } limit 2000
if that, though, you'll multiple results when have multiple english movie titles, release years, etc. if want avoid that, want group ?movie
. jena's right reject things
select ?movie (str(?movie_title) ?title) { ?movie :hastitle ?movie_title } group ?movie
because str(?movie_title)
doesn't make sense. each ?movie
, you've set of ?movie_title
s. need representative title set. now, doesn't movie has more 1 english title. can check query like:
select ?movie (count(?mtitle) ?ntitles) { ?movie dbpedia-owl:film ; rdfs:label ?mtitle . filter langmatches(lang(?mtitle),'en') } group ?movie having count(?mtitle) > 1
given that, means can safely group ?movie ?movie_title
let use ?movie_title
in projection variable list. release date? still end more 1 of those, in principle. data give more one, in fact, can see query:
select distinct ?movie (group_concat(?movie_year;separator=';') ?years) { ?movie rdf:type dbpedia-owl:film. ?movie rdfs:label ?movie_title. filter langmatches(lang(?movie_title), 'en') optional { ?movie dbpprop:released ?rel_year } optional { ?movie dbpedia-owl:releasedate ?owl_year} optional { ?movie dcterms:subject ?sub. ?sub rdfs:label ?movie_year_sub filter regex(?movie_year_sub, ".*[0-9]{4}.*", "i") } bind(coalesce(?owl_year, ?rel_year, ?movie_year_sub) ?movie_year) } group ?movie ?movie_title having count(?movie_year) > 1 limit 2000
this means you'll need value based on set. sparql gives few functions (e.g., max
, min
, sum
). in case, don't know if there's easy way pick "best" representative, might want sample
it, giving query this:
select distinct ?movie (str(sample(?movie_year)) ?year) ?movie_title { ?movie rdf:type dbpedia-owl:film. ?movie rdfs:label ?movie_title. filter langmatches(lang(?movie_title), 'en') optional { ?movie dbpprop:released ?rel_year } optional { ?movie dbpedia-owl:releasedate ?owl_year} optional { ?movie dcterms:subject ?sub. ?sub rdfs:label ?movie_year_sub filter regex(?movie_year_sub, ".*[0-9]{4}.*", "i") } bind(coalesce(?owl_year, ?rel_year, ?movie_year_sub) ?movie_year) } group ?movie ?movie_title limit 2000
this legal sparql, confirmed sparql.org validator (once provide prefix definitions), jena should fine it, , virtuoso (in case, dbpedia endpoint) accepts it, too.
Comments
Post a Comment