Ismersz olyat, akinek a digitális kultúra fejlesztése a missziója? Könyvtári rendszerek építése a szenvedélye? 😊
Netán te magad vagy ilyen? Akkor gyere és vegyél részt a kultúra és az informatika közös felületeinek további építésében!
seen from Venezuela
seen from Russia

seen from Belgium
seen from Mexico

seen from United States
seen from China

seen from United States

seen from Malaysia

seen from United States

seen from Singapore

seen from United States
seen from United Kingdom
seen from United States

seen from Canada
seen from China

seen from Malaysia

seen from Australia
seen from United States
seen from Canada

seen from Australia
Ismersz olyat, akinek a digitális kultúra fejlesztése a missziója? Könyvtári rendszerek építése a szenvedélye? 😊
Netán te magad vagy ilyen? Akkor gyere és vegyél részt a kultúra és az informatika közös felületeinek további építésében!
Documentary heritage is far more than just books. Working in a local history library, I come across a wide variety of items that help record
my father has a 2000+ item beer can collection [yes, the 'tism] and now I know how to catalogue them
KHỔ MẪU MARC21 LÀ GÌ? CẤU TRÚC VÀ CÁC TRƯỜNG KHỔ MẪU DỮ LIỆU
Khổ mẫu MARC21 là khổ mẫu định dạng cho phép máy tính trình bày, lưu trữ, truy xuất và trao đổi thông tin thư mục kể cả những thông tin liên quan dưới dạng máy tính có thể đọc được.
->>>Xem chi tiết tại: https://bit.ly/3eetV7l
--------------------------- Phần mềm Quản lý thư viện Lạc Việt Vebrary 🏠 Địa chỉ: 23 Nguyễn Thị Huỳnh, P. 8, Q. Phú Nhuận, TP. HCM, VN 🌐 Website: https://thuvien.lacviet.vn 📩 Email: [email protected] ☎️ Hotline: 0901 55 50 63
Parsing ISBD. part II, Contextualizing MARC Data
When I resumed blogging last year, I had aimed to post at least a couple of times per month. It was an ambitious goal and I did not succeed; hopefully this year will be better?
Anyway, please enjoy the following long overdue conclusion to our ISBD parsing discussion, rescued from my drafts. I wrote this many months ago, and have since been working on deepening my understanding of serials cataloguing, but I'm going to publish this post as is so that we can move on to other MARC topics next time!
Though the parser combinators we talked about in the last post are powerful, we sometimes need more context when making ISBD parsing decisions. Consider the following two examples:
Abstracts of Bulgarian scientific literature. Mathematics, physics, astronomy, geophysics, geodesy / Bulgarian Academy of Sciences, Centre for Scientific Information and Documentation.
MInd, the meetings index. Series SEMT, Science, engineering, medicine, technology.
The first is for one set of volumes (titled Mathematics, physics, astronomy, geophysics, geodesy) of a multipart monograph (Abstracts of Bulgarian scientific literature); the second is for a series titled Science, engineering, medicine, technology, designated by the series name Series SEMT, within the journal MInd, the meetings index.
The data upto the first period in both cases denotes the "common title" of each work, but it's what follows that's interesting. There are two possible ISBD patterns that can be applied here, based on the grammar alone:
Common title. Dependent title designation, Dependent title
Common title. Dependent title
As you can see, the commas in the dependent title of both examples make it ambiguous as to which way they should be parsed. Technically, there's no reason why "Mathematics" in the first title couldn't be parsed as a dependent title designation, even though we can tell from our understanding of English that that isn't correct. Our parser doesn't understand natural language, though; it needs some simpler way to decide which rule to apply.
There are two different ways to parse MARC title data, and as this example shows, neither on its own is the right way. A MARC 245 field can be parsed according to its subfields, or according to its ISBD grammar as we've been doing, but these parses are non-composable (the elements extracted by each parse do not always line up with each other), and kind of orthogonal to each other (each may capture something that the other misses).
As I've said before, context-sensitive parsing allows us to feed extra information to the parser to help "contextualize" its parsing decisions. In this case, we want to allow our ISBD parser to access and work with the subfields-based parse when parsing the whole title statement. (Note that this is very different from breaking a field into its subfields and then trying to analyze the grammar within each subfield; instead we're keeping the data intact and looking at the ISBD structure in parallel to the subfields structure.)
Going back to the dependent title designation (DTD) problem, we could use this idea to define a subparser as follows:
Look for a candidate DTD in the form of a string followed by a comma.
Consult the subfields parse to see if there are any subfield n values from our current parse position that match the candidate DTD.
If there are, update the parse position in both the ISBD and the subfields parse, and return a successful match. If there are no values in the MARC data that match our candidate, then return a failure to match, which will allow the parser to backtrack and try a different pattern.
With this logic, when the parser attempts to match "Mathematics" as a DTD, it will fail to find |nMathematics in the MARC data (because that string is part of subfield p), and will instead correctly use the second pattern above.
Parsing ISBD. Interlude : Parser Combinators
I was going to pick up from the end of my previous post on ISBD, but I thought I should take a minute to talk about parser combinators first. There's lots on the internet already about what these are, so I’ll keep this specific to how they are used in the ISBD parsing library I mentioned last time.
Simply put, this approach to writing a parser consists of first writing smaller parsers focused on specific parts of the input, then combining them using higher-order “combinators” which encapsulate parsing logic like repetition, choosing between alternatives, etc. For example, isbd-parser defines several simple parsers that match exact sequences in the grammar, like " = ", " ; ", etc. Given the equalSign parser which matches " = " and a parser called data that matches any string not containing special grammar symbols, we could define a parser that matches parallel data as follows:
val parallelData: rule = seq(equalSign, data)
Here, seq is a combinator that produces a parser which runs the parsers given as input in sequence. (The rule type is a wrapper class around parsers in Autumn.) So, parallelData matches any input that begins with " = " followed by a string of data.
In a similar way, we can build up a parser for parsing a full parallel title statement given smaller parsers for each of the components of the statement; simplfiying a little, these would be the title proper, other title information, and statement of responsibility:
val otherInfo: rule = seq(colon, data) val sor: rule = seq(slash, data) val parallelTitleFull: rule = seq( parallelData, otherInfo.maybe(), sor.maybe() )
Notice the .maybe() on the otherInfo and sor parsers; this is another combinator which produces a parser that matches zero or one occurrences of the input matched by the original parser, similar to ? in regular expressions. So parallelTitleFull matches an equal sign followed by a title proper, and optionally other title info and/or a SOR.
I mentioned last time the ambiguity with " = " in ISBD statements; one possible context in which you might find an equal sign is after a statement of responsibility, where it can either be just a parallel SOR1:
Bibliotheca Celtica : a register of publications relating to Wales and the Celtic peoples and languages / Llyfrgell Genedlaethol Cymru = The National Library of Wales
...or the start of a parallel version of the entire title statement:
Xiao xiao xiao xiao de huo / [Mei] Wu Qishi zhu = Little fires everywhere / Celeste Ng.
This is a case where the type of the data after the " = " depends on both what has already been parsed and what remains to be parsed. How do we handle this? Well, we've already got a parser for the full parallel title statement and here's a simple one for the parallel SOR:
val parallelSOR: rule = seq(equalSign, data)
In a statement of responsibility context, we want to apply exactly one of these parsers, depending on which one best matches the remaining input after the " = ". To do that, we can use the longest combinator:
val parallelSOROrTitle: rule = longest(parallelSOR, parallelTitleFull)
longest tries each parser in turn, and chooses the one that consumes the most input. Thus, if there is only a SOR, the parallel data will be parsed as a statement of responsibility; otherwise, it will be parsed as a parallel title statement. (Note that the SOR is technically a little more complicated, in an uninteresting way; the grammar has been simplified here for the sake of illustration.)
Since a title statement doesn't need to contain anything but a title proper, how do we know the string "The National Library of Wales" is actually a SOR and not a title? Mercifully, ISBD specifies that if there is only parallel title proper information, it must be recorded after the title proper itself, not after the entire title statement, so we can safely rule out that possibility here. ↩︎
Parsing ISBD. part I
I’ve had a long-running draft about parsing ISBD that I keep editing and re-editing, unsure of how to start with this complicated subject. ISBD is an outmoded system of punctuation used to structure bibliographic data. An ISBD statement can often contain more information than MARC fields alone capture, such as parallel titles or granular contribution information for collected works. Parsing it accurately can improve the quality of data that is possible to extract from legacy MARC records. This is quite a challenge, though; the ISBD grammar is not technically regular, and therefore impossible to parse reasonably using common tools based on regular expressions.
I carried a printout of one of the most difficult parts of the grammar -- the title statement punctuation -- in my notebook for the better part of last year, something to idly stare at when my brain needed a break from whatever less important problems I was working on at the time. A few months ago, I finally sat down and wrote out the core of a context-sensitive ISBD parser. The context-sensitivity is crucial: unlike regexps, or even context-free parsers, context-sensitive techniques allow you to share extra information with the parser to aid its parsing decisions, which turns out to be the key to handling ISBD’s ridiculous complexity.
What kinds of “extra information”? One case is for precisely understanding equal signs in the input. An equal sign is followed by parallel (i.e., translated) information of a type that depends on what data the parser has already seen. So, an equal sign before a colon (which indicates supplemental title info, like a subtitle) would mean that the data following the equal sign is a parallel title proper; while an equal sign after a colon (but before a slash!) could mean the data is a parallel subtitle, but could also mean the data is a parallel title that will be followed by a parallel subtitle; and so on. The context-sensitive approach would store information describing the history of the parse ("Was the last thing I saw a title, subtitle, statement of responsibility, etc?”) and refer to this information when encountering an equal sign. The very nice thing about doing this in a framework like Autumn is that this stateful information is automatically kept accurate in light of things like backtracking. So you end up with quite a tidy solution for a complicated piece of parsing logic.
Another interesting case, which I think I’ll talk about next time, is when there isn’t enough information in the input itself to determine how it should be parsed, so the MARC data needs to be combined with the input to more accurately extract information.
Lukács 18:24 Milyen nehezen mennek be az Isten országába azok, akiknek vagyonuk van. (ÚjFord 1990) Oh melly nehezen mennek be az Istennek országába azok az kinek sok marháiok vagyon. (Sylvester János 1541) ---- Ha van valami igazán fontos az életedben, akkor az észrevétlenül, de hátha lehet a teljes hitnek. #egymondatos #marc21 https://www.instagram.com/p/B9-5siEBLBd/?igshid=fut41lmcqare