Monday, May 30, 2016

Literary Dopplegängers and interestingness

I started this post with a few digital-humanities posturing paragraphs: if you want to read them, you'll encounter them eventually. But instead let me just get the point: here's a trite new category of analysis that wouldn't be possible without distant reading techniques that produces sometimes charmingly serendipitous results.

I'll call it dopplegänger books. A dopplegänger is, for any world-historically great work of literature, a book that shares many of the same themes, subjects, and language, but is comparatively obscure, not widely read, and--most likely--of surpassingly mediocre quality.

Edit: Ryan Cordell informs me privately and regretfully that I'm wrong in some of my conclusions here. I said, "I took a grand total of one English literature class in college; does anyone expect me to be right?" But he's worried that my wrongness might reflect poorly on the field of DH, which has a history of critics straw-manning offhand blog posts into terrible representatives of the field. So let me say up front: Persons attempting to find an argument in this post will be prosecuted; persons attempting to find political advocacy in it will be banished; persons expecting me to have anything above a high-schooler's knowledge of English literature will be shot.

Take Huck Finn. In hazy recollection (I haven't read the whole book in probably 10 years), much of what seems great about it is the purely American picaresque of a vision of America. Twain's interest "is in the the boy in whose mouth he puts the story, and in this boy's view of the world as it passes under his eye." Huck "is a true child of the river," and gives us a view of America seen through the eyes of "a perfect vagabond of a youngster, wandering up and down the river at his will, taking in the passing show with open mind, finding it all for to admire."

All those quotes, as you may already have guessed guessed, are not describing Huck Finn at all, but instead come from a review of Charles Stewart's Partners of Providence (1904).


Read the book online through Hathi

The table of contents is pretty fascinatingly close to Huckleberry Finn; the reviewers note the comparison, and it's hard to imagine that the tale of a young boy's adventures up and down the river with an entertaining ethnic (here Irish) sidekick past swindlers and exhibitions and perils wasn't somehow noveled on the most famous humorist in the country.
But there are surely differences as well; I wouldn't be surprised if an in-class discussion on the racial politics Huckleberry Finn couldn't benefit from a brief comparison to Partners' account of "the marooning and subsequent escape of a pair of pugnacious darkies."

Across the c. 4.5 million public domain volumes in the Hathi Trust, there are a surprising number of these, many books that seem (based on Google searches) to languish in deserved obscurity. (I've got a set of tricks that actually finding the pairings more feasible than running 20 trillion pairwise comparisons, but the exact mechanics of that are for another day). But they're interesting; not in a "distant reading" way, but in that they provide some greater focus around the core texts we all read already.


So let me just plug a few books in here and see what comes back. My criteria are just that the original book be canonical.

Huckleberry Finn

Twain is closest to himself; Huck Finn is closest to the later Tom Sawyer books than to Tom Sawyer itself, which should perhaps not be surprising.

But nearest-neighbor searching also reveals a deep vein of western boys literature. We know that this exists; the interesting questions here would probably involve the specific ways (especially dialect: these are mostly first person narratives in highly vernacular styles) that writers imitate Twain.

Publication years also provide a point of departure. All the books here were written substantially later than Huck Finn except for "Live boys in the Black Hills." So if I were going to pick any up, maybe I'd start there.


Moby-Dick

This has fewer straightforward imitators; but the whaling novel is a perfectly well-represented genre.
The closest match is the romance "The Red Eric; or, The whaler's last cruise. A tale" from 1883. Some elements of the contents are provocative, at least; but the similarities are less than perfect. (Red Eric's captain's "insane resolution" is to bring his daughter on a whaling cruise with him, for example).




A few other options include a collection of sea stories,


The Cruise of the Cachalot and Sea-wrack, by Frank Bullen, offer some of the more interesting comparisons. Properly shuffled, it makes sense that Moby Dick's closest companions might include not literature at all, but piecemeal miscellanea from the magazines like this ("Sea-Wrack")


Middlemarch

Middlemarch is somewhat harder to find close matches for uninteresting reasons: since the novel is so long, it was frequently chopped into 2, 3, or 4 parts; and each one of those sections ranks highly on the list.
The nearest novels are by Dinah Craik, who I don't know, but who seems well enough established as a poor man's George Eliot in the scholarly literature. (Googling quickly brought me to the online version of Sally Mitchell's monograph on the author.). "Hannah", the closest, is characterized by Mitchell as "a one-issue novel with a narrow legislative aim."

Fraternity; a romance ... (1910) is a harder nut to crack. It's a rural novel set in Wales and published by Macmillan around 1888, but the only surviving digital copy was (according to library metadata) published in the United States in 1910. (Galsworthy's 1911 novel Fraternity further muddies things here.) It's the subject of a strikingly positive review in the Boston press that explicitly casts it as a diamond in the rough.



I was going to let it go there, but then discovered a whole separate track via this book. The author is one Miss M. M. Holland Thomas, and the novel somehow attracted the intense admiration of JP Morgan for its message of social reform through benevolent patronage. (It is Morgan who paid for the American reprint in 1910.) Does this story have anything to do with a similarity to Middlemarch? Hmm. there's definitely something here about the connections between the English social novel and political intentions. But beyond that, I couldn't say.


The Education of Henry Adams

The absolute closest match is his brother's autobiography. Which should surprise no one, and I'm sure I've encountered the book before. "Early Memories" by Henry Cabot Lodge is also high on the list, which is probably a decent choice as well. But I'll pick as the dopplegänger Cambridge Sketches by Frank Preston Stearns, which hits a number of the same points

The Souls of Black Folk

A real genre-bender of a book, even more than Moby Dick. And even less often reprinted.

The closest match is a fairly dull-seeming hagiography of Booker T. Washington. But I'll take as a shadow "Up stream: an American chronicle" by Ludwig Lewisohn. It seems to be the personal memoir of a German-born Jew who grew up in Charleston, SC before attending Columbia and (eventually) becoming a founding faculty member at Brandeis. The grounds for similarity aren't entirely clear--perhaps some odd combination of self-recognition, music, and the South?--but that's what makes it an interesting track. Some of the 


Autobiography of an ex-colored man

On the topic of great Af-Am literature. This one was suggested to me as a candidate by John Reuland. For this one I'm pasting in a longer list of matches, because we were initially very disappointed at the results. (Very little African American literature on the list).

But on looking at the list, what there is is an extraordinary amount of autobiographical self-help literature about money. So maybe there's some lesson to be gleaned there.


OK, that's enough.

Portrait of the Artist as a Young Man

Again, the matches aren't as clear; a vocabulary-based approach like mine works best thematically distinct themes like riverboats, not with "childhood."

There are some vaguely interesting similarities: at #3, I particularly like "What to read at Winter Entertainments," in which it appears the closest antecedent to Joyce is a stuffed-together hodgepodge of great British writers from the 19th century. Sounds about right.

But as a Doppleganer, I'll take Shaw Desmond's Gods, which seems to cover similar places in the Irish experience of the early 20th century.


On Interestingness

I've thinking about Ted Underwood's "old-fashioned, shamelessly opinionated, 1000-word blog post" from yesterday. There are parts I wholeheartedly agree with, such as the section where he dances near to, but decorously avoids citing, Kieran Healy's magnum opus on what calls for nuance do in contemporary academic discourse. There are parts I don't; I'm increasingly convinced that efforts to apply and invent novel algorithmic practices should be fully central to the work of some humanists, and that calls to return to the primary questions of the disciplines are not just premature but somewhat misguided.*

(Roughly, although I should boil this up into a richer stew at some point: very few people outside a philosophy department think that only academic philosophers should do philosophy; very few people *inside* history departments think that only academic historians should do history. Just as we let political philosophy flourish in politics departments and cultural history flourish in art and music departments, computer programming shouldn't be the sole province of computer science departments.)

Is this interesting? I'm not sure. It's not here-I-come-PMLA interesting, for sure. But then again, I've never deliberately sought out much contemporary literary history written since 1980 or so. For a certain sort of Arnoldian prudish conception of literature, I kind of like the game. Much like my anachronism-searching blog posts, it's a field-and-context approach to literature where the whole is not treated as the object of study itself (the stated purpose of much "distant reading") but as a conveniently large wall on which to reposition the works of literature we're already interested in. What that means for literary history, I think I'm under no professional obligation to say.

Bonus links

A little bonus for those who read through to the end; a temporary link to a live interface to the engine I used for this thing, so you can play along at home. Just go to http://benschmidt.org/similarities/ and you can paste in any text you're interested in. Terms and conditions are: don't link to that page, because this may not scale; and e-mail me or post in the comments if you find any terrible bugs or interesting matches.

Tuesday, November 3, 2015

Word embedding models

A heads-up for those with this blog on their RSS feeds: I've just posted a couple things of potential interest on one of the two other blogs (errm) I'm running on my own site.

One, "Vector Space Models for the digital humanities," describes how a newly improved class of algorithms known as word embedding models work and showcases some of their potential applications for digital humanities researchers.

The other, "Rejecting the gender binary," is a more substantive look at how the method can help us better imagine a version of English without gendered language through some tricks of linear algebra; that results in a sort of translation dictionary between the way students talk about men and the way they talk about women.

I'm aware that this blog is sort of twisting on the vine right now. I like the politics of not using Google, and the ability to embed real javascript that comes with not using Blogger. Perhaps the humane thing to do would be retire this site and direct you to http://benschmidt.org/posts/ and http://bookworm.benschmidt.org instead. But I like keeping it around, and will probably come back here next time I have something to say about, say, the hilariously inadequate college rankings the Economist just published, or just to link other stuff.

Monday, January 19, 2015

State of the Union--and corpus comparison.

Mitch Fraas and I have put together a two-part interactive for the Atlantic using Bookworm as a backend to look at the changing language in the State of Union. Yoni Appelbaum, who just took over this week, spearheaded a great team over there including Chris Barna, Libby Bawcombe, Noah Gordon, Betsy Ebersole, and Jennie Rothenberg Gritz who took some of the Bookworm prototypes and built them into a navigable, attractive overall package. Thanks to everyone.

The first part is an interactive map with every place name we could find using the Stanford Natural Language Toolkit and some (Fraas-flavored) elbow grease. Then we got two great historians of American foreign policy, Dael Norwood and Gretchen Heefner, to explain some of the things in the maps.

The second is about individual words presidents use. So the recent rise in "Freedom," the references to the Constitution predominantly in the time of crisis, and so forth.

My favorite feature, and one that the Atlantic team executed beautifully, is the deep access into individual texts: click on a circle or a bar, and you are off reading the actual paragraph from the state of the union that uses that word on mentions that place. This has always been a core feature of Bookworm on various levels--by treating paragraphs as documents for the modelling, it's easy to drill straight to the interesting stuff. One thing that's mostly missing are the Ngrams-style line charts. I've been saying for a while that I hope people see Bookworm enabling other forms of visualization. These pages are a great example of that; maps and bar charts of words are just as engaging, and sometimes things like "presidents" and "the world" are more engaging than individual years.

So go check those out. They speak for themselves.

But for the text analysis crowd, I also wanted to tell you a little more about the link down right at the bottom of the second (words) piece, and get a little technical about why that, although we decided not to include on the Atlantic site, contains the germ of something I find pretty interesting for online text analysis in general.

Tuesday, December 30, 2014

Federal College Rankings: The pitfalls of a magical regression model

Far and away the most interesting idea of the new government college ratings emerges toward the end of the report. It doesn't quite square the circle of competing constituencies for the rankings I worries about in my last post, but it gets close. Lots of weight is placed on a single magic model that will predict outcomes regardless of all the confounding factors they raise (differing pay by gender, sex, possibly even degree composition). As an inveterate modeler and data hound, I can see the appeal here. The federal government has far better data than US News and World Report, in the guise of the student loan repayment forms; this data will enable all sorts of useful studies on the effects of everything from home-schooling to early-marriage. I don't know that anyone is using it yet for the sort of studies it makes possible (do you?), but it sounds like they're opening the vault just for these college ranking purposes.

The challenges raised to the rankings in the report are formidable. Whether you think they can work depends on how much faith you have in the model. I think it's likely to be dicey for two reasons: it's hard to define "success" based on the data we have, and there are potentially disastrous downsides to the mix of variables that will be used as inputs.

Federal college rankings: who are they for?

Before the holiday, the Department of Education circulated a draft prospectus of the new college rankings they hope to release next year. That afternoon, I wrote a somewhat dyspeptic post on the way that these rankings, like all rankings, will inevitably be gamed. But it's probably better to bury that off and instead point out a couple looming problems with the system we may be working under soon. The first is that the audience for these rankings is unresolved in a very problematic way; the second is that altogether two much weight is placed on a regression model solving every objection that has been raised. Finally, I'll lay out my "constructive" solution for salvaging something out of this, which is that rather than use a three-tiered "excellent" - "adequate" - "needs improvement", everyone would be better served if we switched to a two-tiered "Good"/"Needs Improvement" system. Since this is sort of long, I'll break it up into three posts: the first is below.


Thursday, December 18, 2014

Administrative layers

Sometimes it takes time to make a data visualization, and sometimes they just fall out of the data practically by accident. Probably the most viewed thing I've ever made, of shipping lines as spaghetti strings, is one of the latter. I'm working to build one of the former for my talk at the American Historical Association out of the Newberry Library's remarkable Atlas of Historical County Boundaries. But my second ggplot with the set, which I originally did just to make sure the shapefiles were working, was actually interesting. So I thought I'd post it. Here's the graphic: then the explanation. Click to enlarge.



Tuesday, December 16, 2014

Fundamental plot arcs, seen through multidimensional analysis of thousands of TV and movie scripts

Note: a somewhat more complete and slightly less colloquial, but eminently more citeable, version of this work is in the Proceedings of the 2015 IEEE International Conference on Big Data. Plus, it was only there that I came around to calling the whole endeavor "plot arceology."

It's interesting to look, as I did at my last post, at the plot structure of typical episodes of a TV show as derived through topic models. But while it may help in understanding individual TV shows, the method also shows some promise on a more ambitious goal: understanding the general structural elements that most TV shows and movies draw from. TV and movies scripts are carefully crafted structures: I wrote earlier about how the Simpsons moves away from the school after its first few minutes, for example, and with this larger corpus even individual words frequently show a strong bias towards the front or end of scripts. These crafting shows up in the ways language is distributed through them in time.

So that's what I'm going to do here: make some general observations about the ways that scripts shift thematically. In its own, this stuff is pretty interesting--when I first started analyzing the set, I thought it might an end in itself. But it turns out that by combining those thematic scripts with the topic models, it's possible to do something I find really fascinating, and a little mysterious: you can sketch out, derived from the tens of thousands of hours of dialogue in the corpus, what you could literally call a plot "arc" through multidimensional space.


Words in screen time

First, let's lay the groundwork. Many, many individual words show strong trends towards the beginning or end of scripts. In fact, plotting movies in what I'm calling "screen time" usually has a much more recognizable signature than plotting things in the "historic time" you can explore yourself in the movie bookworm. So what I've done is cut every script there into "twelfths" of a movie or TV show; the charts here show the course of an episode or movie from the first minute at the left to the last one at the right. For example: the phrase "love you" (as in, mostly, "I love you") is most frequent towards the end of movies or TV shows: characters in movies are almost three times more likely to profess their love in the last scene of a movie than in the first.

Thursday, December 11, 2014

Typical TV episodes: visualizing topics in screen time

The most interesting element of the Bookworm browser for movies I wrote about in my last post here is the possibility to delve into the episodic structure of different TV shows by dividing them up by minutes. On my website, I previously wrote about story structures in the Simpsons and a topic model of movies I made using the general-purpose bookworm topic modeling extension. For a description of the corpus or of topic modeling, see those links.

Note: Part II of this series, which goes into quantifying the fundamental shared elements of plot arcs, is now up here.

In this post, I'm going to combine those two projects. What can we see by looking at the different content of TV shows? Are there elements to the ways that TV shows are laid out--common plot structures--that repeat? How thematically different is the end of a show from its beginning? I want to take a first stab at those questions by looking at a couple hundred TV shows and their structure. To do that, I:

1. Divided a corpus of 80,000 movies and TV show episodes into 3 minute chunks, and then divided each show into 12 roughly-equal parts.
2. Generated a 128-topic model where each document is one of those 3-minute chunks, which should help the topics be better geared to what's on screen at any given time.
3. For every TV show, plotted the distribution of the ten most common topics with the y-axis roughly representing percent of dialogue of the show in the topic, and the x-axis corresponding to the twelfth of the show it happened in. So dialogue in minute 55 of a 60-minute show will be in chunk 11.

First a note: these images seem not to display in some browsers. If you want to zoom and can't read the legends, right click and select "view in a new tab."

Let's start by looking at a particularly formulaic show: Law and Order.





The two most common topics in Law & Order are "court case Mr. trial lawyer" and "murder body blood case". Murder is strongest in the first twelfth, when the body is discovered; "court case" doesn't appear in any strength until almost halfway through, after which it grows until it takes up more than half the space by the last twelfth.

That's pretty good straight off: the process accurately captures the central structuring element of the show, which is the handoff from cops to lawyers at the 30 minute mark. (Or really, this suggests, more like the 25 minute mark). Most of the other topics are relatively constant. (It's interesting that the gun topic is constant, actually, but that's another matter). But a few change--we also get a  decrease in the topic "people kid kids talk," capturing some element of the interview process by the cops; a different conversation topic, "talk help take problem," is more associated with the lawyers. Also, the total curve is wider at the end than at the beginning; that's because we're not looking at all the words in Law & Order, just the top ten out of 127 topics. We could infer, preliminarily, that Law and Order is more thematically coherent in the last half hour than the first one: there's a lot of thematic diversity as the detectives roam around New York, but the courtroom half is always the same.

Compare the spinoffs: SVU is almost identical to the Law & Order mothership, but Criminal Intent gets to the courtroom much later and with less intensity.






See below the fold for more. Be warned: I've put a whole bunch of images into this one.

Monday, September 15, 2014

Screen time!

Here's a very fun, and for some purposes, perhaps, a very useful thing: a Bookworm browser that lets you investigate onscreen language in about 87,000 movies and TV shows, encompassing together over 600 million words. (Go follow that link if you want to investigate yourself).

I've been thinking about doing this for years, but some of the interest in my recent Simpsons browser and some leaps and bounds in the Bookworm platform have spurred me to finally lay it out. This comes from a very large collection of closed captions/subtitles from the website opensubtitles.org; thanks very much to them for providing a bulk download.

Just as a set of line charts, this provides a nice window into changing language. I've been interested in the "need to"/"ought to" shift since I wrote about it in Mad Men: it's quite clear in the subtitle corpus, and the ratio is much higher as of 2014 than anything Ngrams can show.

Add caption

Thursday, September 11, 2014

Some links to myself

An FYI, mostly for people following this feed on RSS: I just put up on my home web site a post about applications for the Simpsons Bookworm browser I made. It touches on a bunch of stuff that would usually lead me to post it here. (Really, it hits the Sapping Attention trifecta: a discussion of the best ways of visualizing Dunning Log-Likelihood, cryptic allusions to critical theory; and overly serious discussions of popular TV shows.). But it's even less proofread and edited than what I usually put here, and I've lately been more and more reluctant to post things on a Google site like this, particularly as blogger gets folded more and more into Google Plus. That's one of the big reasons I don't post here as much as I used to, honestly. (Another is that I don't want to worry about embedded javascript). So, head over there if you want to read it.

While I'm at it, I made a few data visualizations last year that I only shared on Twitter, but meant to link to from here: Those are linked from a single place on my web site. My favorite is the baseball leaderboard, the most popular was either the distorted subway maps or the career charts, and the most useful, I think, is the browser of college degrees by school and institution type. There are a couple others as well. (And there are a few not there that I'll add at some point.)

Wednesday, August 13, 2014

Data visualization rules, 1915

Right now people in data visualization tend to be interested in their field’s history, and people in digital humanities tend to be fascinated by data visualization. Doing some research in the National Archives in Washington this summer, I came across an early set of rules for graphic presentation by the Bureau of the Census from February 1915. Given those interests, I thought I’d put that list online.

As you may know, the census bureau is probably the single most important organization for inculcating visual-statistical literacy in the American public, particularly through the institution of the Statistical Atlas of the United States published in various forms between 1870 and 1920.
A page from the 1890 Census Atlas: Library of Congress

Friday, May 23, 2014

Mind the gap: Incomes, college majors, gender, and higher ed reform

People love to talk about how "practical" different college majors are: and practicality is usually majored in dollars. But those measurements can be very problematic, in ways that might have bad implications for higher education. That's what this post is about.

I'll start with a paradox that anyone who talks to young people about their college majors should understand.

Let's say you're going to college to maximize your future earnings. You've read the census report that says your choice of major can make millions of dollars of difference, so you want to pick the right one. In the end, you're deciding between majoring in finance or nursing. Which one makes you more money?

Correction, 5/24/14: I've just realized I made an error in assigning weights that meant the numbers I gave originally in this post were for heads of household only, not all workers. I'm fixing the text with strikethroughs, because that's what people seem to do, and adding new charts while shrinking the originals down dramatically. None of the conclusions are changed by the mistake.

The obvious thing to do is look at median incomes in each field. Limiting to those who work 30hrs a week and are between 30 and 45 years old, you'd get these results. (Which is just the sort of thing that census report tells you).

Original version
Same chart, all workers
Nursing majors make a median of $69,000 $65,000; finance majors make $78,000$70,000.

That means you'll make 13% more as a finance major, right?

Wrong. This is pretty close, instead, to a straightforward case of Simpson's Paradox.* Even though the average finance major makes more than the average nursing major, the average individual will make more in nursing. Read that sentence again: it's bizarre, but true.

How can it be true? Because any individual has to be male or female. (Fine, not really: but for the purposes of government datasets like this, you have to choose one). And when you break down pay by gender, something strange happens:

Original version, head of household only


Male nurses do indeed make less than male finance majors ($72,00085,000 vs $76,00080,000 in median income).
But that's more than offset by the fact that female nurses make much more than their finance counterparts ($64,00067,699.78** vs $57,00061,000). The average person will actually make more with a nursing degree than with finance degree.

So why the difference? Because there are hardly any men who major in nursing, and hardly any women who major in finance, so the median income ends up being about male wages for finance, and about female wages for nursing.
Original version, heads of household only.

The apparent gulf between finance and nursing has nothing to do with the actual majors, and everything to do with the pervasive gender gaps across the American economy.

Like many examples of Simpson's paradox, this has some real-world implications. There's a real push (that census report is just one example) to think of college majors more vocationally. Charts of income by major are omnipresent. There's even a real danger that universities will get some federal regulation using loan repayment rates, which won't be independent of income, to determine what colleges are doing a "good" or "bad" job.

Every newspaper chart or college loan program that doesn't disaggregate by gender is going to make the majors that women choose look worse than the ones that men choose. Think we need more people to major in computer science, engineering, and economics? Think we need fewer sociology, English, and Liberal Arts majors? That's not just saying that high-paying fields are better: it's also saying that the sort of fields women major in more often are less worthwhile.

How important is gender? Very. A male English major probably makes more than makes the same as a female math major, and a female economics major makes less than a male history major. So the next time you see someone arguing that only fools major in art history, remind them that the real thing holding back most English majors in the workplace isn't their degree but systemic discrimination against their sex in the American economy.***

By the way: you might be thinking, "That's great: the ACS includes major, now we have some real evidence." You shouldn't. Data collection isn't apolitical. The reason that the ACS includes major is because the state has turned its gaze to college major as a conceivable area of government regulation. We're going to get a lot of thoughtlessly anti-humanities findings out of this set: For example, that census department report grandly concluded that people who major in the humanities are much less likely to find full-time, year-round employment, while burying in a footnote that schoolteachers--the top or second-most common job for most humanities majors--don't count as year-round employees because they take the summer off. **** So, brace yourself. One of the big red herrings will be focusing on earnings for 23-year-olds; this ignores both the fact that law (which you can't start until age 26) is a common and lucrative destination for humanities majors, but also that liberal arts majors catch up, since their skills (to speak of it instrumentally) don't atrophy as quickly. Not to mention all the non-pecuniary rewards.*****

So one of the big challenges over the next few years for advocates of fields that include a lot of women (which includes psychology, education, and communication, as well as many of the humanities) is going to be sussing out the implication of the gender gap for proposed policies and regulations. A perfectly crafted higher ed policy would, of course, take this into account: but it's extremely unlikely that we'll get one of those, if indeed we need one. It would be a bitterly ironic outcome if attempts to fix college majors ended up rewarding fields like computer science for becoming systematically less friendly to women over the last few decades.

This isn't to say there aren't real effects: pharmacology and electrical engineering majors do make more money, certainly, than arts or communications majors. But while the gender disparity is a massive, critical element to every discussion of wages, it's not the only thing lurking behind these numbers. (I've only imperfected adjusted for age, for example).

So I'm reluctant to give the average incomes at all, since I suspect that even with gender factored out they might confuse us. Still, it's worth thinking about. So here they are: a chart of the most common majors, showing median income for men and women: the arrows show the shift from the actual median income to what it would be if both genders were equally represented.

Original version: heads of household only.



*Actually, this isn't a perfect case of Simpson's paradox, because the male rate is indeed lower; there's a third variable at play here, the size of the gender gap within each field: although it's everywhere, the gender gap isn't necessarily the same size.

**Median incomes usually come out as round numbers, because most people report it approximately; but sometimes, as here, they don't.

***I don't actually recommend you do precisely that, from a lobbying perspective. 

**** That's why I've made the somewhat questionable choice of not reducing the set down to "full time year round" workers as is conventional: instead, I'm using the weaker filter of persons under 60 with a college degree who worked at least 30 hours a week. 

***** Which, yes, I believe are more important than the few thousand dollars you might get by agreeing to sell pharmaceuticals the rest of your life. But it's critically important not to just cede the field on less exalted measured of success.