Kig Metadata Save

File metadata extraction tool and Ruby library

Project README

Thanks

Konrad Meyer for his patient testing and bug reports. Darren Kirby for the heads-up on wmainfo's ASF-parsing capabilities (along with being the author of wmainfo-rb and flacinfo-rb.)

Description

This package Metadata' comes with a library called metadata' and a small program called `mdh'.

The library probes files for their metadata (e.g. jpeg dimensions and camera make, mp3 artist, pdf text and word count) and returns the metadata as a Hash. All strings in the metadata are converted to UTF-8.

The `mdh'-program can print out file metadata as YAML and package the metadata with the file.

The metadata hash follows the shared file metadata spec naming, with some additional fields, see list at the end of this file (Appendix A.)

For details on the MDH file format, see the end of this file (Appendix B.)

Usage

print out metadata for myfile.jpg

mdh myfile.jpg

create myfile.jpg.mdh, which consists of an MDH metadata header + myfile.jpg

mdh -c myfile.jpg

print out the metadata header from an MDH file

mdh -e -p myfile.jpg.mdh

strip out the metadata header from an MDH file and write the actual file

to myfile.jpg

mdh -e myfile.jpg.mdh

include file path, filename, md5sum and sha1sum in the metadata header

mdh --path --name -m -s myfile.jpg

guess title for document (first line that starts with a capital letter)

mdh --guess-title foo.ps

guess title, abstract and metadata for document

mdh --guess-metadata foo.ps

don't include document text (File.Content) in the metadata

mdh --no-text foo.ps

query CiteSeer with the document title, add possible results to metadata

mdh --citeseer foo.ps

query DBLP with the document title, add possible results to metadata

mdh --dblp foo.ps

If you have an unknown CS document, this might help identify it:

mdh --guess-metadata --dblp --citeseer foo.ps

print out the list of options

mdh -h

irb> require 'metadata' irb> Metadata.extract('myfile.jpg') irb> Metadata.extract_text('myfile.pdf') irb> Pathname.new("myfile.jpg").metadata

List of supported formats

Audio: Whatever you manage to make mplayer play. Plus special handlers for FLAC, m4a, ape, musepack, wavepack and wma.

Successfully tested with:
  mp3, flac, ogg, wav, ra, m4a, wma

Should also work:
  wv, mpc, ape

Video: Whatever you manage to make mplayer play.

Successfully tested with:
  wmv, mov, divx, xvid, flv, ogg, mpg, mkv

Images: Should handle pretty much anything. I.e. anything handled by ExifTool, ImageMagick, Imlib2 or dcraw.

Successfully tested with:
  Web formats:
    jpeg, png, gif, svg
  Camera raws:
    nef, dng, crw, pef, orf, arw, raf, cr2
  Image editor state dumps:
    psd, xcf
  The rest:
    tga, tif, bmp, xpm, ppm, pcx

Documents: Successfully tested with: Web formats: html, txt Print formats: pdf, ps, ps.gz OO formats: sxi, odp MS formats: doc, ppt, xls

- I'm using unoconv to convert OO & MS docs to temp PDFs for the text &
  dimensions extraction, so those bits of data are missing. MSOffice docs
  are missing dimensions for the same reason. Here's a way to get them:
  ( first, get Thumbnailer: http://github.com/kig/thumbnailer/tree/master )
  $ thumbnailer -s 1 -k foo.odp /tmp/foo.jpg
  $ mdh foo.odp
  $ rm foo.odp-temp.pdf /tmp/foo.jpg

Others: - BitTorrent .torrent files - Archive contents: tar.gz, zip - Whatever `extract' outputs and I am handling

Formats that yield very little metadata: ai

Formats that don't yield usable metadata: chm, sis, rb, rar, ttf

Formats that fail mimetype guessing: exr

Requirements

  • Ruby 1.8

  • Tons of metadata extraction programs and libs. This package has many dependencies since there is no single universal metadata header format that all files use. Blame resource forks, filename extensions, bags of bytes and mimetypes.

    List of gems: flacinfo-rb wmainfo-rb MP4Info id3lib-ruby apetag text hpricot ruby-mp3info

    List of Debian packages: dcraw libimlib2-ruby extract libimage-exiftool-perl poppler-utils mplayer html2text imagemagick unhtml pstotext antiword catdoc shared-mime-info

  • You do want to install the latest versions of dcraw and shared-mime-info to be able to handle camera raw images. http://cybercom.net/~dcoffin/dcraw/ http://freedesktop.org/wiki/Software/shared-mime-info

  • Python + chardet library http://chardet.feedparser.org/

Install

De-compress archive and enter its top directory. Then type:

($ su) # ruby setup.rb

These simple step installs this program under the default location of Ruby libraries. You can also install files into your favorite directory by supplying setup.rb some options. Try "ruby setup.rb --help".

Appendix A: Metadata fields

This list contains the metadata fields output by Metadata and mdh. The list follows the shared file metadata spec for the most part. http://wiki.freedesktop.org/wiki/Specifications/shared-filemetadata-spec

field name | field type

Archive.Contents array of pathnames

Audio.Band string Audio.Composer string Audio.Conductor string Audio.Copyright string (copyright message) Audio.Grouping string Audio.Image base64-encoded binary string (embedded image data) Audio.InterpretedBy string Audio.Lyricist string Audio.Publisher string Audio.RemixedBy string Audio.Subtitle string Audio.Tempo integer Audio.VariableBitrate boolean Audio.Writer string Audio.Publicationright string Audio.File string Audio.EAN/UPC string Audio.ISBN string Audio.Catalog string Audio.LC string Audio.Media string Audio.Index string Audio.Related string Audio.ISRC string Audio.Abstract string Audio.Language string Audio.Bibliography string Audio.Introplay string Audio.Dummy string Audio.DebutAlbum string Audio.RecordDate string Audio.RecordLocation string v-- ORIGINAL FIELDS USED --v Audio.Title string Audio.Artist string Audio.Album string Audio.AlbumArtist string Audio.AlbumTrackCount integer Audio.TrackNo integer Audio.DiscNo integer Audio.Performer string Audio.Duration float Audio.ReleaseDate datetime Audio.Comment string Audio.Genre string Audio.Codec string Audio.Samplerate integer Audio.Bitrate float Audio.Channels integer Audio.Lyrics string

Doc.Album string Doc.Artist string Doc.Charset string Doc.Description string Doc.Genre string Doc.Language string Doc.ModifyDate date Doc.PageSizeName string (A4, A5, letter, ...) Doc.RevisionHistory array of strings Doc.ParagraphCount integer Doc.LineCount integer Doc.CharacterCount integer Doc.LastSavedBy string Doc.Keywords array of strings Doc.Template string Doc.Publisher string Doc.PublicationName string Doc.PublicationPages string Doc.Citations array of {href=>a, title=>b, rest=>c} hashes Doc.Contributor string Doc.CiteSeerIdentifier string Doc.CiteSeerURL string Doc.Published datetime Doc.Source string Doc.DBLPIdentifier string Doc.CrossRef string (BibTex crossref) Doc.BibSource string (BibTex source) Doc.BibTexType string (BibTex type: article, inbook, ...) Doc.ACMCategories array of strings v-- ORIGINAL FIELDS USED --v Doc.Title string Doc.Subject string Doc.Author string Doc.PageCount integer Doc.WordCount integer Doc.Created datetime

File.Software string (software used to create the file) File.MD5Sum string (md5sum of file's contents) File.SHA1Sum string (sha1sum of file's contents) v-- ORIGINAL FIELDS USED --v File.Name string (basename of the file) File.Path string (dirname of the file) File.Format string (mime type, inode/directory for dirs) File.Size integer File.Content string File.Modified string

Image.DateCreated date Image.DateTimeCreated date Image.DateTimeOriginal date Image.DimensionUnit string (px, mm, pt, ...) Image.Editor string Image.EXIF string (exiftool output) Image.FrameCount integer Image.LayerCount integer Image.Modified date Image.OriginatingProgram string Image.ComponentCount integer Image.ColorMode string (e.g. RGB) Image.ColorSpace string (e.g. sRGB) v-- ORIGINAL FIELDS USED --v Image.Height float Image.Width float Image.Title string Image.Date datetime Image.Creator string Image.Description string Image.Software string Image.CameraMake string Image.CameraModel string Image.ExposureProgram string Image.ExposureTime float Image.Fnumber float Image.Flash boolean Image.FocalLength float Image.ISOSpeed float Image.MeteringMode string Image.WhiteBalance string Image.Copyright string

Location.Latitude float Location.Longitude float

Video.Album string Video.Artist string Video.Bitrate integer Video.Codec string Video.Comment string Video.Duration float Video.Framerate float (frames per second) Video.Genre string Video.ReleaseDate date Video.Title string Video.TrackNo integer Video.Demuxer string

BitTorrent.Name string BitTorrent.Files array of { 'path' => string, 'length' => integer, 'md5sum' => string } BitTorrent.Length integer (size of single-file torrents) BitTorrent.MD5Sum string (md5sum for single-file torrents) BitTorrent.PieceCount integer BitTorrent.PieceLength integer (length of a single piece BitTorrent.Comment string BitTorrent.Announce string (announce url) BitTorrent.AnnounceList array of arrays of strings BitTorrent.Nodes array of [hostname, port] -arrays

Appendix B: The MDH file format

MDH files are built as follows:

bytes | content

3   | "MDH"  - MDH file format identifier
1   | "\x01" - MDH file format version number
4   | Long, network byte order - the size of the metadata struct in bytes

var | YAML - The MDH metadata struct var | The actual file contents

All string fields in the metadata are UTF-8.

License

Ruby's

Ilmari Heikkinen <ilmari.heikkinen gmail com>

Open Source Agenda is not affiliated with "Kig Metadata" Project. README Source: kig/metadata
Stars
83
Open Issues
1
Last Commit
9 years ago
Repository

Open Source Agenda Badge

Open Source Agenda Rating