XMP
xmp.pyExtract Adobe's XMP metadata from jpeg or pdf files, and return it either as a string (in RDF/XML) format, or loaded into an RDFLib TripleStore. There is also a method that takes a URI instead of a file name, and it would then retrieve the XMP data over the network. Based on some tricky regular expressions provided by Sean B. Palmer.
Enlace: Some Python Utilities
#!/usr/bin/env python """ Extract XMP metadata from JPEG and PDF files. The trick to extract the RDF content is based on using a regular expression on the whole file (seen as a string). The idea of using regexp that way for the purpose of XMP extraction came from U{Sean B. Palmer<http://purl.org/net/sbp/>} and U{Dan Brickley<mailto:danbri@w3.org>}. Testing the file type is crude. A more sophisticated file type or preferably, a more general library for all types of files to extract the XMP content would be way better at some point... @author: Ivan Herman """ debug = True import re,os,imghdr def _searchXMLContent(b) : """ Extract the XMP content from the byte stream, using a regular expression search @param b: byte stream of the image content @return: RDF data as a string @rtype: string """ rdfpat = r"(?sm)^.*(<rdf:RDF.*</rdf:RDF>)" r_rdf = re.compile(rdfpat) q = r_rdf.search(b) assert q != None, "Could not find the XMP content in the file" return q.group(1) def _testFile(fname) : """Test whether the file is of a format that can have an XMP information. The test is based on the imghdr library of python. However, that library is not foolproof, unfortunately, I have hit JPG files that are not recognized by imghdr while understood by all the usual image programs (I wonder whether this is related to Photoshop CS2, I have not seen such problems before). Consequently, if the imghdr test fails, the suffix of the file is also considered and the following suffixes are also updated as candidates: 'jpg', 'jpeg', 'JPG', 'JPEG', 'pdf', 'PDF'. I realize this is not really kosher, but I am not in the mood to debug imghdr (besides, pdf files are not considered at all by that one...) @param fname: the filename for the image or the pdf file @return: whether the file is of a proper format @rtype: Boolean """ suffixes = ['jpg','jpeg','pdf'] hdr = imghdr.what(fname) if hdr == None or hdr != 'jpeg' : for sfx in suffixes : if fname.lower().endswith(sfx) : return True # if we got here then, unfortunately, this is not a valid file type return False else: return True def extractXMPFromURI(uri) : """Extract the XML RDF data for PDF and JPG files based on a URI, and return it as a string @param uri: the URI for the image @type uri: string @return: RDF data as a string @rtype: string """ import urllib2 obj = urllib2.urlopen(uri) inf = obj.info() ftype = inf["content-type"].split(";")[0].strip() if ftype == "image/jpeg" or ftype == "application/pdf" : length = int(inf["content-length"]) p = obj.read(length) if debug : t = open("test.jpg","wb") t.write(p) t.flush() t.close() return _searchXMLContent(p) else : raise "Cannot manage this file type: %s" % ftype def extractXMP(fname) : """Extract the XMP RDF data for PDF and JPG files and return it as a string @param fname: the filename for the image or the pdf file @type fname: string @return: RDF data as a string @rtype: string """ if _testFile(fname) : f = file(fname,'rb') return _searchXMLContent(f.read()) else : raise "Unknown FileType", "Cannot manage this file type" def extractXMPTriples(fname,triples = None) : """Extract the XMP RDF data for PDF and JPG files and return and RDFLib triple store. If the triples parameter is not None, then it is considered to be an already existing triple store that has to be extended by the new set of triples. @param fname: the filename for the image or the pdf file @type fname: a string, denoting the file name @param triples: an RDFLib TripleStore (default: None) @type triples: rdflib.TripleStore @return: triples @rtype: RDFLib TripleStore """ rdf = extractXMP(fname) # If it is at that point, no exception has been raised, ie, the rdf content exists # # Note the import here, not in the header; the module should be usable without RDFLib, too... from rdflib.TripleStore import TripleStore if triples == None : triples = TripleStore() # The logical thing to do would be to wrap rdf into a StringIO and load it into the triples # but that does not work with TripleStore, which expects a file name (or I have not found other means) # Ie, the rdf content should be stored in a temporary file. tempfile = fname + "__.rdf" store = file(tempfile,"w") store.write(rdf) store.flush() store.close() triples.load(tempfile) try : # on Windows this may not work... os.remove(tempfile) except : # This works only when called from cygwin. However, is there anybody in his/her able mind # using Windows and python without some sort of a unix emulation ;-) os.system("rm %s" % tempfile) return triples #################################################################################################### if __name__ == '__main__' : import sys if len(sys.argv) > 1 : fname = sys.argv[1] else : fname = "http://www.ivan-herman.net/Photos/xmptest.jpg" triples = extractXMPFromURI(fname) print triples
Discusión