Tutorial¶
This tutorial is meant to give a quick overview of all the features of bibpy. You can also refer to the examples.
Table of Contents¶
Reading and writing¶
You can read reference data from a string or from a file.
>>> import bibpy
>>> result1 = bibpy.read_string('@article{key, title = {Title}, author = {Author}}')
>>> result2 = bibpy.read_file('references.bib')
Both functions return an bibpy.entries.Entries
class containing all
bibliographic entries (@conference
etc.), strings (@string
),
preambles (@preamble
), comment entries (@comment
) and finally
all comments in the source (which exist outside of entries).
By default, bibpy.read_string
and bibpy.read_file
read
data in the relaxed
mode (via the format
parameter) but
supports four different reference formats:
- bibtex: Treat data as bibtex. Raise error on non-conformity.
- biblatex: Treat data as biblatex. Raise error on non-conformity.
- mixed: Treat data as a mixture of bibtex and biblatex. Raise error on non-conformity.
- relaxed: Relax parsing rules. All types of entries and fields are allowed.
For example if you read a file containing an @online
entry as bibtex,
bibpy would raise an error since this entry type only exists in biblatex.
Therefore, the relaxed format is typically recommended when parsing third party
bib files.
Writing bib entries is straight-forward and you do not have to supply a reference format as the entries are simply written with the data they contain.
# Assume we still have the entries from the previous example loaded here
>>> bibpy.write_string(result1)
>>> bibpy.write_file('new-references.bib', result2.entries)
You can pass both the bibpy.entries.Entries
object or a list of
entries to the write functions. Both functions take a lot of formatting options
for e.g. alignment of the equal signs of fields in entries, sorting fields
alphabetically or using a user-defined, partial order. Try running the
formatting example
to see the effects of all the options.
Manipulating reference data¶
bibpy has been designed so manipulating reference data is easy. For this
section, we assume that we have loaded the following data into the variable
entries
:
@article{key1,
author = {James Conway and Archer Sterling},
title = {1337 Hacker},
year = {2010},
month = {4},
institution = {Office of Information Management {and} Communications},
message = {Hello!},
date = {2001-07-19/}
}
@online{key2,
author = {Hugh Morrison}
}
We can iterate the fields and values of the first entry.
for field, value in entries[0]:
print('{0} = {1}'.format(field, value))
All bibtex/biblatex fields are already accessible as properties of the entry
objects and the entries themselves support a range of sensible dict-like
operations. Entry fields that are not present in an entry return None
.
>>> entry = entries[0]
>>> entry.author
'James Conway and Archer Sterling'
>>> entry.year
'2010'
>>> entry.bibtype
'article'
>>> entries[1].bibkey
'key2'
>>> entry['month']
'4'
>>> entry['invalid']
None
>>> entry.message
Hello!
>>> entry.date
2001-07-19/
>>> entry.invalid
None
>>> entry.institution
'Office of Information Management {and} Communications'
>>> 'institution' in entry
True
>>> 'volume' in entry
False
>>> entry == entries[1]
False
>>> entry == entry
True
>>> entry != entries[1]
True
>>> entry.aliases('biblatex') # List of biblatex aliases for 'article'
[]
>>> entries[1].aliases('biblatex')
['electronic', 'www']
>>> entry.valid('biblatex') # Does the entry contain all required fields according to biblatex?
True
>>> entry.fields # Get a list of the active (non-None) fields of the entry
['author', 'title', 'year', 'month', 'institution', 'message']
>>> entry.extra_fields # Get a list of any additional non-bibtex/biblatex fields
['message']
>>> len(entry) # Number of active fields in the entry
6
>>> entry.keys() # Same as fields property (see below)
['author', 'title', 'year', 'month', 'institution', 'message']
>>> entry.values()
['James Conway and Archer Sterling', '1337 Hacker', '2010', '4', 'Office of Information Management {and} Communications']
>>> del entry['institution']
>>> entry.fields
['author', 'title', 'year', 'month', 'message']
>>> entry.clear() # Clear all fields (set to None)
Requirements Checking¶
Both bibtex and biblatex have requirements per entry that are usually not enforced but are needed for proper formatting and bibpy can also check this for you. Consider the entries below.
Only optional date missing
@article{key1,
author = {a},
title = {b},
journaltitle = {c},
year = {d}
}
Missing author field
@article{key4,
title = {b},
journaltitle = {c},
year = {d}
}
Is this valid biblatex?
>>> from bibpy.requirements import check
>>> entries = ... # Load entries
>>> check(entries[0], 'biblatex')
(set(), [])
>>> check(entries[1], 'biblatex')
(set(['author']), []),
The bibpy.requirements.check
function returns a 2-tuple. The first
element is a set of all missing required fields, the second element is a list
of sets of fields where only one of the fields are required. For example, some
bibtex entries need either an author
field or an editor
field.
No requirements are violated by the first entry since biblatex requires either
a year
or date
field and the former is provided.
Alternatively, you can call the bibpy.entry.entry.Entry.validate
method on an entry to validate an exisiting entry which throws a
bibpy.error.RequiredFieldError
if any violations are found.
>>> entries[1].validate('biblatex')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
bibpy.error.RequiredFieldError: Entry 'key4' (type 'article') is missing required field(s): author
The exception contains the offending entry and the required and optional fields
that would be returned from bibpy.requirements.check
. There is also
a bibpy.entry.entry.Entry.valid
method that returns True
or
False
instead of raising an exception.
Finally, bibpy.requirements.collect
finds and aggregates all
requirement violations for a list of entries, grouped by entry. Entries that
conform are not included in the result.
>>> from bibpy.requirements import collect
>>> collect(entries, 'bibtex')
[(entry, (<set of required fields>, [...])), ...]
Post- and Preprocessing Fields¶
You may have noticed in the Manipulating Reference Data section that values
are returned as strings by default. You can supply postprocess=True
to
the read_*
methods to convert a subset of the standard bibtex/biblatex
fields’ values to meaningful python types. Accessing the fields of the entries
from the previous section would now return the following instead.
>>> entries = bibpy.read_file('references.bib', 'biblatex', postprocess=True).entries
>>> entry = entries[0]
>>> entry.author
['James Conway', 'Archer Sterling'] # Author names have been split
>>> entry.year, type(entry.year) # Year is now an int
2010, <type 'int'>
>>> entry.month # Month has a proper name
'April'
>>> entry.institution # Institutions are split but not on '{and}'
['Office of Information Management and Communications']
>>> entry.date # Dates are converted to an object
bibpy.date.DateRange(2001-07-19/)
>>> entry.date.start
datetime.date(2001, 7, 19)
>>> entry.end
None
>>> entry.open // True if an open-ended date range
True
For name lists, ‘and’ is the default delimiter. bibpy does not split on
delimiters enclosed in braces, but removes them afterwards (the institution
field was not split on ‘and’ because it was braced). A biblatex date is
converted to a special bibpy.date.DateRange
object since they can
both refer to single dates and the time period between two dates. In this case,
it refers to an open-ended date (hence the ‘/’ at the end) starting on the 19th
of July 2001. When writing entries, its postprocessed fields are automatically
converted back to their pre-postprocessed counterparts.
If you need to postprocess fields manually (for example, you need to postprocess a subset of fields only when a condition is met), you can use the postprocessing functions directly.
from bibpy.postprocess import postprocess
entries = bibpy.read_file(...).entries
if condition:
for entry in entries:
# Postprocess the 'author' and 'date' fields if present
postprocess(entry, ['author', 'date'])
String Variable Expansion¶
Some reference files contain string variables like these:
@string{var1 = "Morrison"}
@string(var2 = "Harvard")
@article{key,
title = "Jake " # var1,
}
Each string entry contains a single variable name and a value for that
variable. By using bibpy.expand_strings
on the entries after
reading, the article entry will be as though it had been as follows in the file
instead.
@article{key,
title = "Jake Morrison"
}
Let’s try and load the entry interactively.
>>> result = bibpy.read_file('references.bib', 'mixed')
>>> entries, strings = result.entries, result.strings
>>> entries[0].title
'"Jake" # var1'
>>> bibpy.expand_strings(entries, strings) # Done in-place
>>> entries[0].title
"Jake Morrison"
>>> bibpy.unexpand_strings(entries, strings) # We can also revert the expansion
>>> entries[0].title
'"Jake" # var1'
We can also undo the string variable expansion using
bibpy.unexpand_strings
. Both functions raise errors if they find
duplicate variable names by default which would make unexpansion impossible for
entries that use the duplicates. The unexpansion might also unexpand unrelated
text that happens to be the same as that of a variable. There is currently no
way to avoid this.
Crossreferences and xdata Inheritance¶
There are three primary ways to do inheritance through fields:
crossref
, xdata
and xref
. The latter is not supported
as no data is actually directly inherited, it is just a non-inheriting
reference to another entry. Imagine we have the following two fields in a file.
@inbook{key1,
crossref = {key2},
title = {Title},
author = {Author},
pages = {5--25}
}
@book{key2,
subtitle = {Booksubtitle},
title = {Booktitle},
author = {Author2},
date = {1995},
publisher = {Publisher},
location = {Location}
}
Reading in the file with bibpy and then using
bibpy.inherit_crossrefs
, the inbook
entry can inherit the
appropriate fields from the book
entry (done in-place).
>>> results = bibpy.read_file('crossreferences.bib', 'relaxed')
>>> bibpy.inherit_crossrefs(results.entries)
Printing out the entries again shows that the title
and
subtitle
fields from the book
entry have been inherited (the
ordering of the fields may vary).
@inbook{key1,
crossref = {key2},
title = {Title},
booktitle = {Booktitle},
booksubtitle = {Booksubtitle},
author = {Author},
pages = {5--25}
}
@book{key2,
subtitle = {Booksubtitle},
title = {Booktitle},
author = {Author2},
date = {1995},
publisher = {Publisher},
location = {Location}
}
You can uninherit the fields again with bibpy.uninherit_crossrefs
.
You can also inherit and uninherit xdata
fields. The difference is that
while crossref
fields follow specific rules about which fields are
inherited and what their names become, xdata
simply pulls in the fields
from the ancestor and can optionally be made to overwrite existing fields with
the same names. If postprocess=True
when reading (see Post- and
Preprocessing Fields), xdata
fields are converted from a
comma-separated string to a list of keys.
bibpy Tools¶
bibpy comes with three command line tools which we discuss in turn.
bibformat¶
The bibformat tool can be used to align equal signs =
, expand string
variables and reorder fields. Run bibformat --help
for full details.
Below is an example of reordering fields (ordering the author
and
title
fields before other fields in all entries, the rest are
arbitrarily ordered), aligning equal signs and surrounding field values with
double-quotes instead of braces.
$ bibformat --order='author,title' --align --surround='""'
@article{key4,
author = "Archer Sterling",
title = "A Practical Guide To Getting Ants",
year = "1995",
month = "3"
}
bibstats¶
The bibstats tool displays statistics about bib entries. Run bibstats
--help
for full details. Below is an example of querying a bib source.
$ bibstats --count source1.bib
Found 4 entries
$ bibstats --top=3 source2.bib # Display the top 3 occurring entries
Entry Count
-----------------------------------------
article 881 (60.38%)
inproceedings 256 (17.55%)
techreport 113 (7.75%)
Total entries: 1459
bibgrep¶
The bibgrep tool is similar to the grep command but filters entries instead of lines.
$ bibgrep --entry="article" --field="author~hughes" --ignore-case
The above invocation selects entries that are either @article
entries
or have “hughes” (case-insensitive) somewhere in their author
field.
The approximation operator ~
also works with regular expressions.
$ bibgrep --field="author~M.+tt" some.bib
Alternatively, one can use =
to require exact matches. We can also
combine bibgrep
with the other tools. Here we also specify inclusive
ranges for years and a lower bound for volume fields.
$ bibgrep --entry="conference" | bibformat --indent=4 > conferences.json
$ bibgrep --field="year=1900-2000" --field="volume>=10" | bibstats --top=5
The first command selects all @conference
entries and bibformat indents
them by 4 spaces. The second command selects all entries that have a year field
in the inclusive range [1900; 2000] or a volume field of 10 or more, then
prints out the statistics for the top 5 occurring entries that satisfy those
predicates.
Selecting entries that satisfy all constraints can be done by piping multiple
invocations of bibgrep
.
$ bibgrep --entry="book" references.bib | bibgrep --field="month=1-3"
This selects all book
entries that were published in the first quarter
of any year.